第14章爬蟲與自動化¶

網絡爬蟲是現代數據獲取和自動化處理的重要技術手段，通過模擬瀏覽器行爲自動訪問網頁並提取所需信息。本章將從基礎概念開始，逐步深入到高級爬蟲框架和自動化技術，幫助讀者掌握完整的爬蟲開發技能。

14.1 網絡爬蟲基礎¶

爬蟲概述¶

網絡爬蟲的定義和用途¶

網絡爬蟲（Web Crawler），也稱爲網頁蜘蛛（Web Spider）或網絡機器人（Web Robot），是一種按照一定規則自動瀏覽萬維網並獲取信息的程序。爬蟲的主要用途包括：

數據採集：從網站獲取商品信息、新聞資訊、股票價格等
搜索引擎：爲搜索引擎建立索引數據庫
市場分析：收集競爭對手信息，進行市場調研
內容監控：監控網站內容變化，及時獲取更新
學術研究：收集研究數據，進行數據分析

爬蟲的工作原理¶

網絡爬蟲的基本工作流程如下：

發送HTTP請求：向目標網站發送請求
接收響應數據：獲取服務器返回的HTML頁面
解析頁面內容：提取所需的數據信息
存儲數據：將提取的數據保存到文件或數據庫
發現新鏈接：從當前頁面中發現新的URL
重複過程：對新發現的URL重複上述過程

讓我們通過一個簡單的示例來理解爬蟲的基本原理：

import requests
from bs4 import BeautifulSoup
import time

def simple_crawler(url):
    """
    簡單的網頁爬蟲示例
    """
    try:
        # 1. 發送HTTP請求
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        # 2. 檢查響應狀態
        if response.status_code == 200:
            # 3. 解析頁面內容
            soup = BeautifulSoup(response.text, 'html.parser')

            # 4. 提取標題
            title = soup.find('title')
            if title:
                print(f"頁面標題: {title.get_text().strip()}")

            # 5. 提取所有鏈接
            links = soup.find_all('a', href=True)
            print(f"找到 {len(links)} 個鏈接:")

            for i, link in enumerate(links[:5]):  # 只顯示前5個鏈接
                href = "https://yeyupiaoling.cn" + link['href']
                text = link.get_text().strip()
                print(f"{i+1}. {text} -> {href}")

        else:
            print(f"請求失敗，狀態碼: {response.status_code}")

    except Exception as e:
        print(f"爬取過程中出現錯誤: {e}")

# 使用示例
if __name__ == "__main__":
    url = "https://yeyupiaoling.cn"
    simple_crawler(url)

運行上述代碼，輸出類似如下：

頁面標題: 夜雨飄零的博客 - 首頁
找到 50 個鏈接:
1.  -> https://yeyupiaoling.cn/
2. 夜雨飄零 -> https://yeyupiaoling.cn/
3. 首頁 -> https://yeyupiaoling.cn/
4. 歸檔 -> https://yeyupiaoling.cn/archive
5. 標籤 -> https://yeyupiaoling.cn/tag

爬蟲的分類和特點¶

根據不同的分類標準，爬蟲可以分爲以下幾類：

按照爬取範圍分類：
- 通用爬蟲：搜索引擎使用的爬蟲，爬取整個互聯網
- 聚焦爬蟲：針對特定主題或網站的爬蟲
- 增量爬蟲：只爬取新增或更新的內容

按照技術實現分類：
- 靜態爬蟲：只能處理靜態HTML頁面
- 動態爬蟲：能夠處理JavaScript渲染的動態頁面

按照爬取深度分類：
- 淺層爬蟲：只爬取首頁或少數幾層頁面
- 深層爬蟲：能夠深入爬取網站的多層結構

爬蟲的法律和道德考量¶

在進行網絡爬蟲開發時，必須遵守相關的法律法規和道德準則：

遵守robots.txt協議：檢查網站的robots.txt文件
控制爬取頻率：避免對服務器造成過大壓力
尊重版權：不要爬取受版權保護的內容
保護隱私：不要爬取個人隱私信息
合理使用數據：僅將爬取的數據用於合法目的

HTTP協議基礎¶

HTTP請求和響應¶

HTTP（HyperText Transfer Protocol）是網絡爬蟲與Web服務器通信的基礎協議。理解HTTP協議對於開發高效的爬蟲至關重要。

HTTP通信包含兩個主要部分：
- 請求（Request）：客戶端向服務器發送的消息
- 響應（Response）：服務器返回給客戶端的消息

讓我們通過代碼來觀察HTTP請求和響應的詳細信息：

import requests
import json

def analyze_http_communication(url):
    """
    分析HTTP請求和響應的詳細信息
    """
    # 創建會話對象
    session = requests.Session()

    # 設置請求頭
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    }

    try:
        # 發送請求
        response = session.get(url, headers=headers)

        print("=== HTTP請求信息 ===")
        print(f"請求URL: {response.request.url}")
        print(f"請求方法: {response.request.method}")
        print("請求頭:")
        for key, value in response.request.headers.items():
            print(f"  {key}: {value}")

        print("\n=== HTTP響應信息 ===")
        print(f"狀態碼: {response.status_code}")
        print(f"響應原因: {response.reason}")
        print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
        print("響應頭:")
        for key, value in response.headers.items():
            print(f"  {key}: {value}")

        print(f"\n響應內容長度: {len(response.text)} 字符")
        print(f"響應內容類型: {response.headers.get('Content-Type', 'Unknown')}")

    except requests.RequestException as e:
        print(f"請求失敗: {e}")

# 使用示例
if __name__ == "__main__":
    analyze_http_communication("https://yeyupiaoling.cn/")

運行結果示例：

=== HTTP請求信息 ===
請求URL: https://yeyupiaoling.cn/
請求方法: GET
請求頭:
  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
  Accept-Encoding: gzip, deflate
  Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
  Connection: keep-alive
  Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3

=== HTTP響應信息 ===
狀態碼: 200
響應原因: OK
響應時間: 0.197秒
響應頭:
  Server: nginx/1.18.0 (Ubuntu)
  Date: Sat, 16 Aug 2025 04:36:49 GMT
  Content-Type: text/html; charset=utf-8
  Transfer-Encoding: chunked
  Connection: keep-alive
  Vary: Cookie
  Content-Encoding: gzip

響應內容長度: 29107 字符
響應內容類型: text/html; charset=utf-8

Cookie和Session機制¶

Cookie和Session是Web應用中維持用戶狀態的重要機制：

Cookie：存儲在客戶端的小型數據文件
Session：存儲在服務器端的用戶會話信息

在爬蟲開發中，正確處理Cookie和Session對於模擬用戶登錄和維持會話狀態至關重要：

import requests
from http.cookies import SimpleCookie

def demonstrate_cookies_and_sessions():
    """
    演示Cookie和Session的使用
    """
    # 創建會話對象
    session = requests.Session()

    print("=== Cookie操作演示 ===")

    # 1. 設置Cookie
    cookie_url = "https://httpbin.org/cookies/set"
    cookie_params = {
        'username': 'testuser',
        'session_id': 'abc123',
        'preferences': 'dark_theme'
    }

    # 設置Cookie（這會導致重定向）
    response = session.get(cookie_url, params=cookie_params)
    print(f"設置Cookie後的狀態碼: {response.status_code}")

    # 2. 查看當前Cookie
    print("\n當前會話中的Cookie:")
    for cookie in session.cookies:
        print(f"  {cookie.name} = {cookie.value}")

    # 3. 發送帶Cookie的請求
    cookie_test_url = "https://httpbin.org/cookies"
    response = session.get(cookie_test_url)
    if response.status_code == 200:
        cookies_data = response.json()
        print(f"\n服務器接收到的Cookie: {cookies_data.get('cookies', {})}")

    # 4. 手動設置Cookie
    print("\n=== 手動Cookie操作 ===")
    manual_session = requests.Session()

    # 方法1：通過字典設置
    manual_session.cookies.update({
        'user_id': '12345',
        'auth_token': 'xyz789'
    })

    # 方法2：通過set方法設置
    manual_session.cookies.set('language', 'zh-CN', domain='httpbin.org')

    # 測試手動設置的Cookie
    response = manual_session.get("https://httpbin.org/cookies")
    if response.status_code == 200:
        cookies_data = response.json()
        print(f"手動設置的Cookie: {cookies_data.get('cookies', {})}")

    # 5. Cookie持久化
    print("\n=== Cookie持久化 ===")

    # 保存Cookie到文件
    import pickle

    # 保存Cookie
    with open('cookies.pkl', 'wb') as f:
        pickle.dump(session.cookies, f)
    print("Cookie已保存到文件")

    # 加載Cookie
    new_session = requests.Session()
    try:
        with open('cookies.pkl', 'rb') as f:
            new_session.cookies = pickle.load(f)
        print("Cookie已從文件加載")

        # 測試加載的Cookie
        response = new_session.get("https://httpbin.org/cookies")
        if response.status_code == 200:
            cookies_data = response.json()
            print(f"加載的Cookie: {cookies_data.get('cookies', {})}")
    except FileNotFoundError:
        print("Cookie文件不存在")

# 模擬登錄示例
def simulate_login_with_session():
    """
    模擬網站登錄過程
    """
    print("\n=== 模擬登錄流程 ===")

    session = requests.Session()

    # 1. 訪問登錄頁面（獲取必要的Cookie和token）
    login_page_url = "https://httpbin.org/cookies/set/csrf_token/abc123def456"
    response = session.get(login_page_url)
    print(f"訪問登錄頁面: {response.status_code}")

    # 2. 提交登錄表單
    login_data = {
        'username': 'testuser',
        'password': 'testpass',
        'csrf_token': 'abc123def456'
    }

    login_url = "https://httpbin.org/post"
    response = session.post(login_url, data=login_data)

    if response.status_code == 200:
        print("登錄請求發送成功")
        response_data = response.json()
        print(f"提交的登錄數據: {response_data.get('form', {})}")

    # 3. 訪問需要登錄的頁面
    protected_url = "https://httpbin.org/cookies"
    response = session.get(protected_url)

    if response.status_code == 200:
        print("成功訪問受保護頁面")
        cookies_data = response.json()
        print(f"當前會話Cookie: {cookies_data.get('cookies', {})}")

# 運行演示
if __name__ == "__main__":
    demonstrate_cookies_and_sessions()
    simulate_login_with_session()

運行結果：

=== Cookie操作演示 ===
設置Cookie後的狀態碼: 200

當前會話中的Cookie:
  username = testuser
  session_id = abc123
  preferences = dark_theme

服務器接收到的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}

=== 手動Cookie操作 ===
手動設置的Cookie: {'user_id': '12345', 'auth_token': 'xyz789', 'language': 'zh-CN'}

=== Cookie持久化 ===
Cookie已保存到文件
Cookie已從文件加載
加載的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}

=== 模擬登錄流程 ===
訪問登錄頁面: 200
登錄請求發送成功
提交的登錄數據: {'username': 'testuser', 'password': 'testpass', 'csrf_token': 'abc123def456'}
成功訪問受保護頁面
當前會話Cookie: {'csrf_token': 'abc123def456'}

網頁結構分析¶

HTML基礎結構¶

理解HTML結構是網頁數據提取的基礎。HTML（HyperText Markup Language）使用標籤來定義網頁內容的結構和語義。

一個典型的HTML頁面結構如下：

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>頁面標題</title>
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <header>
        <nav>
            <ul>
                <li><a href="#home">首頁</a></li>
                <li><a href="#about">關於</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <article>
            <h1>文章標題</h1>
            <p class="content">文章內容...</p>
        </article>
    </main>

    <footer>
        <p>&copy; 2024 版權信息</p>
    </footer>

    <script src="script.js"></script>
</body>
</html>

讓我們編寫一個HTML結構分析工具：

import requests
from bs4 import BeautifulSoup
from collections import Counter

def analyze_html_structure(url):
    """
    分析網頁的HTML結構
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            print(f"=== HTML結構分析: {url} ===")

            # 1. 基本信息
            title = soup.find('title')
            print(f"頁面標題: {title.get_text().strip() if title else '無標題'}")

            # 2. 文檔類型和編碼
            doctype = soup.contents[0] if soup.contents and hasattr(soup.contents[0], 'string') else None
            print(f"文檔類型: {doctype if doctype else 'HTML5'}")

            charset_meta = soup.find('meta', attrs={'charset': True})
            if not charset_meta:
                charset_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
            encoding = charset_meta.get('charset') if charset_meta else response.encoding
            print(f"字符編碼: {encoding}")

            # 3. 標籤統計
            all_tags = [tag.name for tag in soup.find_all()]
            tag_counter = Counter(all_tags)
            print(f"\n標籤統計 (前10個):")
            for tag, count in tag_counter.most_common(10):
                print(f"  {tag}: {count}個")

            # 4. 鏈接分析
            links = soup.find_all('a', href=True)
            print(f"\n鏈接分析:")
            print(f"  總鏈接數: {len(links)}")

            internal_links = []
            external_links = []

            for link in links:
                href = link['href']
                if href.startswith('http'):
                    if url in href:
                        internal_links.append(href)
                    else:
                        external_links.append(href)
                elif href.startswith('/'):
                    internal_links.append(href)

            print(f"  內部鏈接: {len(internal_links)}個")
            print(f"  外部鏈接: {len(external_links)}個")

            # 5. 圖片分析
            images = soup.find_all('img')
            print(f"\n圖片分析:")
            print(f"  圖片總數: {len(images)}")

            img_with_alt = [img for img in images if img.get('alt')]
            print(f"  有alt屬性: {len(img_with_alt)}個")

            # 6. 表單分析
            forms = soup.find_all('form')
            print(f"\n表單分析:")
            print(f"  表單總數: {len(forms)}")

            for i, form in enumerate(forms):
                method = form.get('method', 'GET').upper()
                action = form.get('action', '當前頁面')
                inputs = form.find_all(['input', 'select', 'textarea'])
                print(f"  表單{i+1}: {method} -> {action} ({len(inputs)}個字段)")

            # 7. 腳本和樣式
            scripts = soup.find_all('script')
            stylesheets = soup.find_all('link', rel='stylesheet')

            print(f"\n資源分析:")
            print(f"  JavaScript文件: {len(scripts)}個")
            print(f"  CSS樣式表: {len(stylesheets)}個")

            # 8. 結構層次
            print(f"\n頁面結構:")
            body = soup.find('body')
            if body:
                print_structure(body, level=0, max_level=3)

        else:
            print(f"請求失敗，狀態碼: {response.status_code}")

    except Exception as e:
        print(f"分析過程中出現錯誤: {e}")

def print_structure(element, level=0, max_level=3):
    """
    遞歸打印HTML結構
    """
    if level > max_level:
        return

    indent = "  " * level
    tag_name = element.name

    # 獲取重要屬性
    attrs = []
    if element.get('id'):
        attrs.append(f"id='{element['id']}'")
    if element.get('class'):
        classes = ' '.join(element['class'])
        attrs.append(f"class='{classes}'")

    attr_str = f" [{', '.join(attrs)}]" if attrs else ""
    print(f"{indent}<{tag_name}>{attr_str}")

    # 遞歸處理子元素
    for child in element.children:
        if hasattr(child, 'name') and child.name:
            print_structure(child, level + 1, max_level)

# 使用示例
if __name__ == "__main__":
    # 分析一個示例網頁
    analyze_html_structure("https://httpbin.org/html")

運行結果示例：

=== HTML結構分析: https://httpbin.org/html ===
頁面標題: Herman Melville - Moby-Dick
文檔類型: HTML5
字符編碼: utf-8

標籤統計 (前10個):
  p: 4個
  a: 3個
  h1: 1個
  body: 1個
  html: 1個
  head: 1個
  title: 1個

鏈接分析:
  總鏈接數: 3個
  內部鏈接: 0個
  外部鏈接: 3個

圖片分析:
  圖片總數: 0個
  有alt屬性: 0個

表單分析:
  表單總數: 0個

資源分析:
  JavaScript文件: 0個
  CSS樣式表: 0個

頁面結構:
<body>
  <h1>
  <p>
  <p>
  <p>
  <p>

CSS選擇器¶

CSS選擇器是定位HTML元素的強大工具，在網頁數據提取中起着關鍵作用。理解CSS選擇器語法對於精確定位目標元素至關重要。

基本選擇器：
- 標籤選擇器：div、p、a
- 類選擇器：.class-name
- ID選擇器：#element-id
- 屬性選擇器：[attribute="value"]

組合選擇器：
- 後代選擇器：div p（div內的所有p元素）
- 子元素選擇器：div > p（div的直接子p元素）
- 相鄰兄弟選擇器：h1 + p（緊跟h1的p元素）
- 通用兄弟選擇器：h1 ~ p（h1後的所有同級p元素）

僞類選擇器：
- :first-child、:last-child、:nth-child(n)
- :not(selector)、:contains(text)

讓我們通過實例來學習CSS選擇器的使用：

import requests
from bs4 import BeautifulSoup

def demonstrate_css_selectors():
    """
    演示CSS選擇器的使用
    """
    # 創建示例HTML
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>CSS選擇器示例</title>
    </head>
    <body>
        <div class="container">
            <h1 id="main-title">新聞列表</h1>
            <div class="news-section">
                <article class="news-item featured">
                    <h2>重要新聞標題1</h2>
                    <p class="summary">這是新聞摘要...</p>
                    <span class="date">2024-01-15</span>
                    <a href="/news/1" class="read-more">閱讀更多</a>
                </article>
                <article class="news-item">
                    <h2>普通新聞標題2</h2>
                    <p class="summary">這是另一個新聞摘要...</p>
                    <span class="date">2024-01-14</span>
                    <a href="/news/2" class="read-more">閱讀更多</a>
                </article>
                <article class="news-item">
                    <h2>普通新聞標題3</h2>
                    <p class="summary">第三個新聞摘要...</p>
                    <span class="date">2024-01-13</span>
                    <a href="/news/3" class="read-more">閱讀更多</a>
                </article>
            </div>
            <aside class="sidebar">
                <h3>熱門標籤</h3>
                <ul class="tag-list">
                    <li><a href="/tag/tech" data-category="technology">科技</a></li>
                    <li><a href="/tag/sports" data-category="sports">體育</a></li>
                    <li><a href="/tag/finance" data-category="finance">財經</a></li>
                </ul>
            </aside>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("=== CSS選擇器演示 ===")

    # 1. 基本選擇器
    print("\n1. 基本選擇器:")

    # 標籤選擇器
    h2_elements = soup.select('h2')
    print(f"所有h2標籤 ({len(h2_elements)}個):")
    for h2 in h2_elements:
        print(f"  - {h2.get_text().strip()}")

    # 類選擇器
    news_items = soup.select('.news-item')
    print(f"\n所有新聞項 ({len(news_items)}個):")
    for i, item in enumerate(news_items, 1):
        title = item.select_one('h2').get_text().strip()
        print(f"  {i}. {title}")

    # ID選擇器
    main_title = soup.select_one('#main-title')
    print(f"\n主標題: {main_title.get_text().strip()}")

    # 屬性選擇器
    tech_links = soup.select('a[data-category="technology"]')
    print(f"\n科技類鏈接 ({len(tech_links)}個):")
    for link in tech_links:
        print(f"  - {link.get_text().strip()} -> {link.get('href')}")

    # 2. 組合選擇器
    print("\n2. 組合選擇器:")

    # 後代選擇器
    container_links = soup.select('.container a')
    print(f"容器內所有鏈接 ({len(container_links)}個):")
    for link in container_links:
        text = link.get_text().strip()
        href = link.get('href', '#')
        print(f"  - {text} -> {href}")

    # 子元素選擇器
    direct_children = soup.select('.news-section > .news-item')
    print(f"\n新聞區域的直接子元素 ({len(direct_children)}個)")

    # 相鄰兄弟選擇器
    after_h2 = soup.select('h2 + p')
    print(f"\nh2後的相鄰p元素 ({len(after_h2)}個):")
    for p in after_h2:
        print(f"  - {p.get_text().strip()[:30]}...")

    # 3. 僞類選擇器
    print("\n3. 僞類選擇器:")

    # 第一個和最後一個子元素
    first_news = soup.select('.news-item:first-child')
    last_news = soup.select('.news-item:last-child')

    if first_news:
        first_title = first_news[0].select_one('h2').get_text().strip()
        print(f"第一個新聞: {first_title}")

    if last_news:
        last_title = last_news[0].select_one('h2').get_text().strip()
        print(f"最後一個新聞: {last_title}")

    # nth-child選擇器
    second_news = soup.select('.news-item:nth-child(2)')
    if second_news:
        second_title = second_news[0].select_one('h2').get_text().strip()
        print(f"第二個新聞: {second_title}")

    # 4. 複雜選擇器組合
    print("\n4. 複雜選擇器:")

    # 選擇特色新聞的標題
    featured_title = soup.select('.news-item.featured h2')
    if featured_title:
        print(f"特色新聞標題: {featured_title[0].get_text().strip()}")

    # 選擇包含特定文本的元素
    read_more_links = soup.select('a.read-more')
    print(f"'閱讀更多'鏈接 ({len(read_more_links)}個)")

    # 選擇具有特定屬性的元素
    category_links = soup.select('a[data-category]')
    print(f"有分類屬性的鏈接 ({len(category_links)}個):")
    for link in category_links:
        category = link.get('data-category')
        text = link.get_text().strip()
        print(f"  - {text} (分類: {category})")

# 實際網頁CSS選擇器應用
def extract_data_with_css_selectors(url):
    """
    使用CSS選擇器從實際網頁提取數據
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            print(f"\n=== 從 {url} 提取數據 ===")

            # 提取頁面標題
            title = soup.select_one('title')
            if title:
                print(f"頁面標題: {title.get_text().strip()}")

            # 提取所有鏈接
            links = soup.select('a[href]')
            print(f"\n找到 {len(links)} 個鏈接:")

            for i, link in enumerate(links[:5], 1):  # 只顯示前5個
                text = link.get_text().strip()
                href = link.get('href')
                print(f"  {i}. {text[:50]}... -> {href}")

            # 提取所有段落文本
            paragraphs = soup.select('p')
            if paragraphs:
                print(f"\n段落內容 (共{len(paragraphs)}個):")
                for i, p in enumerate(paragraphs[:3], 1):  # 只顯示前3個
                    text = p.get_text().strip()
                    if text:
                        print(f"  {i}. {text[:100]}...")
        else:
            print(f"請求失敗，狀態碼: {response.status_code}")

    except Exception as e:
        print(f"提取數據時出現錯誤: {e}")

# 運行演示
if __name__ == "__main__":
    demonstrate_css_selectors()
    extract_data_with_css_selectors("https://httpbin.org/html")

JavaScript和動態內容¶

現代網頁大量使用JavaScript來動態生成內容，這給傳統的靜態爬蟲帶來了挑戰。動態內容包括：

AJAX加載的數據：通過異步請求獲取的內容
JavaScript渲染的頁面：完全由JS生成的頁面結構
用戶交互觸發的內容：點擊、滾動等操作後顯示的內容
即時更新的數據：WebSocket或定時刷新的內容

處理動態內容的方法：

方法1：分析AJAX請求

import requests
import json

def analyze_ajax_requests():
    """
    分析和模擬AJAX請求
    """
    print("=== AJAX請求分析 ===")

    # 模擬一個AJAX請求
    ajax_url = "https://httpbin.org/json"

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'X-Requested-With': 'XMLHttpRequest',  # 標識AJAX請求
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/json'
    }

    try:
        response = requests.get(ajax_url, headers=headers)

        if response.status_code == 200:
            data = response.json()
            print(f"AJAX響應數據:")
            print(json.dumps(data, indent=2, ensure_ascii=False))
        else:
            print(f"AJAX請求失敗: {response.status_code}")

    except Exception as e:
        print(f"AJAX請求異常: {e}")

# 運行AJAX分析
if __name__ == "__main__":
    analyze_ajax_requests()

方法2：使用Selenium處理JavaScript

# 注意：需要安裝selenium和對應的瀏覽器驅動
# pip install selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

def handle_dynamic_content_with_selenium():
    """
    使用Selenium處理動態內容
    """
    print("=== Selenium處理動態內容 ===")

    # 配置Chrome選項
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # 無頭模式
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')

    try:
        # 創建WebDriver實例
        driver = webdriver.Chrome(options=chrome_options)

        # 訪問包含動態內容的頁面
        driver.get("https://httpbin.org/html")

        # 等待頁面加載完成
        wait = WebDriverWait(driver, 10)

        # 獲取頁面標題
        title = driver.title
        print(f"頁面標題: {title}")

        # 查找元素
        h1_element = wait.until(
            EC.presence_of_element_located((By.TAG_NAME, "h1"))
        )
        print(f"H1內容: {h1_element.text}")

        # 獲取所有鏈接
        links = driver.find_elements(By.TAG_NAME, "a")
        print(f"\n找到 {len(links)} 個鏈接:")

        for i, link in enumerate(links, 1):
            text = link.text.strip()
            href = link.get_attribute('href')
            print(f"  {i}. {text} -> {href}")

        # 執行JavaScript
        js_result = driver.execute_script("return document.title;")
        print(f"\nJavaScript執行結果: {js_result}")

    except Exception as e:
        print(f"Selenium處理異常: {e}")
    finally:
        if 'driver' in locals():
            driver.quit()

# 注意：實際運行需要安裝ChromeDriver
# 這裏只是演示代碼結構

網頁編碼和字符集¶

正確處理網頁編碼是避免亂碼問題的關鍵。常見的編碼格式包括：

UTF-8：支持全球所有字符的Unicode編碼
GBK/GB2312：中文編碼格式
ISO-8859-1：西歐字符編碼
ASCII：基本英文字符編碼

import requests
from bs4 import BeautifulSoup
import chardet

def handle_encoding_issues():
    """
    處理網頁編碼問題
    """
    print("=== 網頁編碼處理 ===")

    # 測試不同編碼的處理
    test_urls = [
        "https://httpbin.org/encoding/utf8",
        "https://httpbin.org/html",
    ]

    for url in test_urls:
        try:
            print(f"\n處理URL: {url}")

            # 獲取原始響應
            response = requests.get(url)

            print(f"響應編碼: {response.encoding}")
            print(f"表觀編碼: {response.apparent_encoding}")

            # 方法1：使用chardet檢測編碼
            detected_encoding = chardet.detect(response.content)
            print(f"檢測到的編碼: {detected_encoding}")

            # 方法2：從HTML meta標籤獲取編碼
            soup = BeautifulSoup(response.content, 'html.parser')

            # 查找charset聲明
            charset_meta = soup.find('meta', attrs={'charset': True})
            if charset_meta:
                declared_charset = charset_meta.get('charset')
                print(f"聲明的編碼: {declared_charset}")
            else:
                # 查找http-equiv類型的meta標籤
                content_type_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
                if content_type_meta:
                    content = content_type_meta.get('content', '')
                    if 'charset=' in content:
                        declared_charset = content.split('charset=')[1].split(';')[0]
                        print(f"聲明的編碼: {declared_charset}")

            # 方法3：正確設置編碼後重新解析
            if detected_encoding['encoding']:
                response.encoding = detected_encoding['encoding']
                soup = BeautifulSoup(response.text, 'html.parser')

                title = soup.find('title')
                if title:
                    print(f"正確編碼後的標題: {title.get_text().strip()}")

        except Exception as e:
            print(f"編碼處理異常: {e}")

def create_encoding_safe_crawler():
    """
    創建編碼安全的爬蟲
    """
    def safe_get_text(url, timeout=10):
        """
        安全獲取網頁文本內容
        """
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }

            response = requests.get(url, headers=headers, timeout=timeout)

            # 1. 首先嚐試使用響應頭中的編碼
            if response.encoding != 'ISO-8859-1':  # 避免錯誤的默認編碼
                soup = BeautifulSoup(response.text, 'html.parser')
            else:
                # 2. 使用chardet檢測編碼
                detected = chardet.detect(response.content)
                if detected['confidence'] > 0.7:  # 置信度閾值
                    response.encoding = detected['encoding']
                    soup = BeautifulSoup(response.text, 'html.parser')
                else:
                    # 3. 嘗試常見編碼
                    for encoding in ['utf-8', 'gbk', 'gb2312']:
                        try:
                            text = response.content.decode(encoding)
                            soup = BeautifulSoup(text, 'html.parser')
                            break
                        except UnicodeDecodeError:
                            continue
                    else:
                        # 4. 使用錯誤處理策略
                        soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')

            return soup

        except Exception as e:
            print(f"獲取頁面內容失敗: {e}")
            return None

    # 測試編碼安全爬蟲
    test_url = "https://httpbin.org/html"
    soup = safe_get_text(test_url)

    if soup:
        title = soup.find('title')
        print(f"\n編碼安全爬蟲結果:")
        print(f"標題: {title.get_text().strip() if title else '無標題'}")

        # 提取文本內容
        paragraphs = soup.find_all('p')
        print(f"段落數量: {len(paragraphs)}")

        for i, p in enumerate(paragraphs[:2], 1):
            text = p.get_text().strip()
            print(f"段落{i}: {text[:100]}...")

# 運行編碼處理演示
if __name__ == "__main__":
    handle_encoding_issues()
    create_encoding_safe_crawler()

爬蟲開發環境¶

開發工具選擇¶

選擇合適的開發工具能夠顯著提高爬蟲開發效率：

IDE和編輯器：
- PyCharm：功能強大的Python IDE，支持調試和代碼分析
- VS Code：輕量級編輯器，豐富的插件生態
- Jupyter Notebook：適合數據分析和原型開發
- Sublime Text：快速的文本編輯器

瀏覽器開發者工具：
- Chrome DevTools：分析網頁結構、網絡請求、JavaScript執行
- Firefox Developer Tools：類似Chrome，某些功能更強大
- 網絡面板：查看HTTP請求和響應
- 元素面板：分析HTML結構和CSS樣式

抓包工具：
- Fiddler：Windows平臺的HTTP調試代理
- Charles：跨平臺的HTTP監控工具
- mitmproxy：基於Python的中間人代理
- Wireshark：網絡協議分析器

代理和IP池¶

使用代理服務器可以隱藏真實IP地址，避免被網站封禁：

import requests
import random
import time
from itertools import cycle

class ProxyManager:
    """
    代理管理器
    """
    def __init__(self):
        # 代理列表（示例，實際使用時需要有效的代理）
        self.proxy_list = [
            {'http': 'http://proxy1:port', 'https': 'https://proxy1:port'},
            {'http': 'http://proxy2:port', 'https': 'https://proxy2:port'},
            {'http': 'http://proxy3:port', 'https': 'https://proxy3:port'},
        ]
        self.proxy_cycle = cycle(self.proxy_list)
        self.failed_proxies = set()

    def get_proxy(self):
        """
        獲取可用代理
        """
        for _ in range(len(self.proxy_list)):
            proxy = next(self.proxy_cycle)
            proxy_key = str(proxy)

            if proxy_key not in self.failed_proxies:
                return proxy

        # 如果所有代理都失敗，清空失敗列表重新開始
        self.failed_proxies.clear()
        return next(self.proxy_cycle)

    def mark_proxy_failed(self, proxy):
        """
        標記代理失敗
        """
        self.failed_proxies.add(str(proxy))

    def test_proxy(self, proxy, test_url="https://httpbin.org/ip"):
        """
        測試代理是否可用
        """
        try:
            response = requests.get(
                test_url, 
                proxies=proxy, 
                timeout=10,
                headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
            )

            if response.status_code == 200:
                data = response.json()
                print(f"代理測試成功，IP: {data.get('origin')}")
                return True
            else:
                print(f"代理測試失敗，狀態碼: {response.status_code}")
                return False

        except Exception as e:
            print(f"代理測試異常: {e}")
            return False

def demonstrate_proxy_usage():
    """
    演示代理使用
    """
    print("=== 代理使用演示 ===")

    # 不使用代理的請求
    try:
        response = requests.get("https://httpbin.org/ip", timeout=10)
        if response.status_code == 200:
            data = response.json()
            print(f"直接訪問IP: {data.get('origin')}")
    except Exception as e:
        print(f"直接訪問失敗: {e}")

    # 使用代理的請求（示例）
    proxy_manager = ProxyManager()

    # 注意：以下代碼需要有效的代理服務器才能正常工作
    print("\n代理測試（需要有效代理）:")
    for i in range(3):
        proxy = proxy_manager.get_proxy()
        print(f"測試代理 {i+1}: {proxy}")

        # 在實際環境中測試代理
        # is_working = proxy_manager.test_proxy(proxy)
        # if not is_working:
        #     proxy_manager.mark_proxy_failed(proxy)

# 免費代理獲取示例
def get_free_proxies():
    """
    獲取免費代理（示例）
    """
    print("\n=== 免費代理獲取 ===")

    # 這裏只是演示結構，實際需要從代理網站爬取
    free_proxy_sources = [
        "https://www.proxy-list.download/api/v1/get?type=http",
        "https://api.proxyscrape.com/v2/?request=get&protocol=http",
    ]

    proxies = []

    for source in free_proxy_sources:
        try:
            print(f"從 {source} 獲取代理...")
            # 實際實現需要解析不同網站的格式
            # response = requests.get(source, timeout=10)
            # 解析代理列表...
            print("代理獲取完成（示例）")

        except Exception as e:
            print(f"獲取代理失敗: {e}")

    return proxies

# 運行代理演示
if __name__ == "__main__":
    demonstrate_proxy_usage()
    get_free_proxies()

用戶代理設置¶

用戶代理（User-Agent）字符串標識客戶端應用程序，設置合適的User-Agent可以避免被識別爲爬蟲：

import requests
import random

class UserAgentManager:
    """
    用戶代理管理器
    """
    def __init__(self):
        self.user_agents = [
            # Chrome
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',

            # Firefox
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0',

            # Safari
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
            'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',

            # Edge
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
        ]

    def get_random_user_agent(self):
        """
        獲取隨機用戶代理
        """
        return random.choice(self.user_agents)

    def get_mobile_user_agent(self):
        """
        獲取移動端用戶代理
        """
        mobile_agents = [
            'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',
            'Mozilla/5.0 (Android 14; Mobile; rv:121.0) Gecko/121.0 Firefox/121.0',
            'Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36',
        ]
        return random.choice(mobile_agents)

def demonstrate_user_agent():
    """
    演示用戶代理的使用
    """
    print("=== 用戶代理演示 ===")

    ua_manager = UserAgentManager()

    # 測試不同的用戶代理
    test_url = "https://httpbin.org/user-agent"

    for i in range(3):
        user_agent = ua_manager.get_random_user_agent()
        headers = {'User-Agent': user_agent}

        try:
            response = requests.get(test_url, headers=headers)
            if response.status_code == 200:
                data = response.json()
                print(f"\n請求 {i+1}:")
                print(f"發送的User-Agent: {user_agent[:50]}...")
                print(f"服務器接收到的: {data.get('user-agent', '')[:50]}...")
        except Exception as e:
            print(f"請求失敗: {e}")

    # 測試移動端用戶代理
    print("\n=== 移動端用戶代理 ===")
    mobile_ua = ua_manager.get_mobile_user_agent()
    headers = {'User-Agent': mobile_ua}

    try:
        response = requests.get(test_url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            print(f"移動端User-Agent: {data.get('user-agent')}")
    except Exception as e:
        print(f"移動端請求失敗: {e}")

# 運行用戶代理演示
if __name__ == "__main__":
    demonstrate_user_agent()

調試和測試工具¶

有效的調試和測試工具能夠幫助快速定位和解決爬蟲開發中的問題：

import requests
import time
import logging
from functools import wraps

# 配置日誌
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('crawler.log'),
        logging.StreamHandler()
    ]
)

def debug_request(func):
    """
    請求調試裝飾器
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()

        try:
            result = func(*args, **kwargs)
            end_time = time.time()

            logging.info(f"{func.__name__} 執行成功，耗時: {end_time - start_time:.3f}秒")
            return result

        except Exception as e:
            end_time = time.time()
            logging.error(f"{func.__name__} 執行失敗，耗時: {end_time - start_time:.3f}秒，錯誤: {e}")
            raise

    return wrapper

class CrawlerDebugger:
    """
    爬蟲調試器
    """
    def __init__(self):
        self.request_count = 0
        self.success_count = 0
        self.error_count = 0
        self.start_time = time.time()

    @debug_request
    def debug_get(self, url, **kwargs):
        """
        調試版本的GET請求
        """
        self.request_count += 1

        # 默認headers
        default_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }

        headers = kwargs.get('headers', {})
        headers.update(default_headers)
        kwargs['headers'] = headers

        logging.info(f"發送GET請求到: {url}")
        logging.debug(f"請求參數: {kwargs}")

        try:
            response = requests.get(url, **kwargs)

            logging.info(f"響應狀態碼: {response.status_code}")
            logging.info(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
            logging.debug(f"響應頭: {dict(response.headers)}")

            if response.status_code == 200:
                self.success_count += 1
            else:
                self.error_count += 1
                logging.warning(f"非200狀態碼: {response.status_code}")

            return response

        except requests.RequestException as e:
            self.error_count += 1
            logging.error(f"請求異常: {e}")
            raise

    def get_stats(self):
        """
        獲取統計信息
        """
        elapsed_time = time.time() - self.start_time

        stats = {
            '總請求數': self.request_count,
            '成功請求數': self.success_count,
            '失敗請求數': self.error_count,
            '成功率': f"{(self.success_count / max(self.request_count, 1)) * 100:.2f}%",
            '運行時間': f"{elapsed_time:.2f}秒",
            '平均請求速度': f"{self.request_count / max(elapsed_time, 1):.2f}請求/秒"
        }

        return stats

    def print_stats(self):
        """
        打印統計信息
        """
        stats = self.get_stats()

        print("\n=== 爬蟲統計信息 ===")
        for key, value in stats.items():
            print(f"{key}: {value}")

def test_crawler_debugger():
    """
    測試爬蟲調試器
    """
    debugger = CrawlerDebugger()

    test_urls = [
        "https://httpbin.org/get",
        "https://httpbin.org/status/200",
        "https://httpbin.org/delay/1",
        "https://httpbin.org/status/404",  # 這個會返回404
        "https://httpbin.org/json",
    ]

    print("開始測試爬蟲調試器...")

    for url in test_urls:
        try:
            response = debugger.debug_get(url, timeout=10)
            print(f"✓ {url} - 狀態碼: {response.status_code}")
        except Exception as e:
            print(f"✗ {url} - 錯誤: {e}")

        time.sleep(0.5)  # 避免請求過快

    # 打印統計信息
    debugger.print_stats()

# 性能測試工具
def performance_test(func, *args, **kwargs):
    """
    性能測試裝飾器
    """
    def test_performance(iterations=10):
        times = []

        for i in range(iterations):
            start_time = time.time()
            try:
                func(*args, **kwargs)
                end_time = time.time()
                times.append(end_time - start_time)
            except Exception as e:
                print(f"第{i+1}次測試失敗: {e}")

        if times:
            avg_time = sum(times) / len(times)
            min_time = min(times)
            max_time = max(times)

            print(f"\n=== 性能測試結果 ({iterations}次) ===")
            print(f"平均時間: {avg_time:.3f}秒")
            print(f"最短時間: {min_time:.3f}秒")
            print(f"最長時間: {max_time:.3f}秒")
            print(f"成功率: {len(times)}/{iterations} ({len(times)/iterations*100:.1f}%)")

    return test_performance

# 運行調試演示
if __name__ == "__main__":
    test_crawler_debugger()

    # 性能測試示例
    @performance_test
    def simple_request():
        response = requests.get("https://httpbin.org/get", timeout=5)
        return response.status_code == 200

    print("\n開始性能測試...")
    simple_request(iterations=5)

運行結果示例：

開始測試爬蟲調試器...
2024-01-15 14:30:15,123 - INFO - 發送GET請求到: https://httpbin.org/get
2024-01-15 14:30:15,456 - INFO - 響應狀態碼: 200
2024-01-15 14:30:15,456 - INFO - 響應時間: 0.333秒
2024-01-15 14:30:15,456 - INFO - debug_get 執行成功，耗時: 0.334秒
✓ https://httpbin.org/get - 狀態碼: 200

2024-01-15 14:30:16,001 - INFO - 發送GET請求到: https://httpbin.org/status/200
2024-01-15 14:30:16,234 - INFO - 響應狀態碼: 200
2024-01-15 14:30:16,234 - INFO - 響應時間: 0.233秒
2024-01-15 14:30:16,234 - INFO - debug_get 執行成功，耗時: 0.234秒
✓ https://httpbin.org/status/200 - 狀態碼: 200

=== 爬蟲統計信息 ===
總請求數: 5
成功請求數: 4
失敗請求數: 1
成功率: 80.00%
運行時間: 3.45秒
平均請求速度: 1.45請求/秒

=== 性能測試結果 (5次) ===
平均時間: 0.456秒
最短時間: 0.234秒
最長時間: 0.678秒
成功率: 5/5 (100.0%)

14.2 Requests庫網絡請求¶

Requests是Python中最受歡迎的HTTP庫，它讓HTTP請求變得簡單而優雅。相比於Python標準庫中的urllib，Requests提供了更加人性化的API，是網絡爬蟲開發的首選工具。

Requests基礎¶

安裝和基本使用¶

Requests庫的安裝非常簡單，使用pip命令即可：

pip install requests

安裝完成後，我們來看看Requests的基本使用方法：

import requests
import json
from pprint import pprint

def basic_requests_usage():
    """
    演示Requests的基本使用方法
    """
    print("=== Requests基礎使用演示 ===")

    # 1. 最簡單的GET請求
    print("\n1. 基本GET請求:")
    response = requests.get('https://httpbin.org/get')

    print(f"狀態碼: {response.status_code}")
    print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
    print(f"內容類型: {response.headers.get('content-type')}")

    # 2. 檢查請求是否成功
    if response.status_code == 200:
        print("請求成功!")
        data = response.json()  # 解析JSON響應
        print(f"服務器接收到的URL: {data['url']}")
    else:
        print(f"請求失敗，狀態碼: {response.status_code}")

    # 3. 使用raise_for_status()檢查狀態
    try:
        response.raise_for_status()  # 如果狀態碼不是200會拋出異常
        print("狀態檢查通過")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP錯誤: {e}")

    # 4. 獲取響應內容的不同方式
    print("\n2. 響應內容獲取:")

    # 文本內容
    print(f"響應文本長度: {len(response.text)}字符")

    # 二進制內容
    print(f"響應二進制長度: {len(response.content)}字節")

    # JSON內容（如果是JSON格式）
    try:
        json_data = response.json()
        print(f"JSON數據鍵: {list(json_data.keys())}")
    except ValueError:
        print("響應不是有效的JSON格式")

    # 5. 響應頭信息
    print("\n3. 響應頭信息:")
    print(f"服務器: {response.headers.get('server', '未知')}")
    print(f"內容長度: {response.headers.get('content-length', '未知')}")
    print(f"連接類型: {response.headers.get('connection', '未知')}")

# 運行基礎演示
if __name__ == "__main__":
    basic_requests_usage()

運行結果：

=== Requests基礎使用演示 ===

1. 基本GET請求:
狀態碼: 200
響應時間: 0.234秒
內容類型: application/json
請求成功!
服務器接收到的URL: https://httpbin.org/get
狀態檢查通過

2. 響應內容獲取:
響應文本長度: 312字符
響應二進制長度: 312字節
JSON數據鍵: ['args', 'headers', 'origin', 'url']

3. 響應頭信息:
服務器: gunicorn/19.9.0
內容長度: 312
連接類型: keep-alive

GET和POST請求¶

GET和POST是HTTP協議中最常用的兩種請求方法。GET用於獲取數據，POST用於提交數據。

GET請求詳解：

import requests
from urllib.parse import urlencode

def demonstrate_get_requests():
    """
    演示各種GET請求的使用方法
    """
    print("=== GET請求詳解 ===")

    # 1. 基本GET請求
    print("\n1. 基本GET請求:")
    response = requests.get('https://httpbin.org/get')
    print(f"請求URL: {response.url}")
    print(f"狀態碼: {response.status_code}")

    # 2. 帶參數的GET請求
    print("\n2. 帶參數的GET請求:")

    # 方法1: 使用params參數
    params = {
        'name': '張三',
        'age': 25,
        'city': '北京',
        'hobbies': ['讀書', '游泳']  # 列表參數
    }

    response = requests.get('https://httpbin.org/get', params=params)
    print(f"構建的URL: {response.url}")

    data = response.json()
    print(f"服務器接收到的參數: {data['args']}")

    # 方法2: 直接在URL中包含參數
    url_with_params = 'https://httpbin.org/get?name=李四&age=30'
    response2 = requests.get(url_with_params)
    print(f"\n直接URL參數: {response2.json()['args']}")

    # 3. 自定義請求頭
    print("\n3. 自定義請求頭:")
    headers = {
        'User-Agent': 'MySpider/1.0',
        'Accept': 'application/json',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Referer': 'https://www.example.com'
    }

    response = requests.get('https://httpbin.org/get', headers=headers)
    received_headers = response.json()['headers']

    print(f"發送的User-Agent: {headers['User-Agent']}")
    print(f"服務器接收到的User-Agent: {received_headers.get('User-Agent')}")

    # 4. 超時設置
    print("\n4. 超時設置:")
    try:
        # 設置連接超時爲3秒，讀取超時爲5秒
        response = requests.get('https://httpbin.org/delay/2', timeout=(3, 5))
        print(f"請求成功，耗時: {response.elapsed.total_seconds():.3f}秒")
    except requests.exceptions.Timeout:
        print("請求超時")
    except requests.exceptions.RequestException as e:
        print(f"請求異常: {e}")

    # 5. 處理重定向
    print("\n5. 重定向處理:")

    # 允許重定向（默認行爲）
    response = requests.get('https://httpbin.org/redirect/2')
    print(f"最終URL: {response.url}")
    print(f"重定向歷史: {[r.url for r in response.history]}")

    # 禁止重定向
    response_no_redirect = requests.get('https://httpbin.org/redirect/1', allow_redirects=False)
    print(f"\n禁止重定向狀態碼: {response_no_redirect.status_code}")
    print(f"Location頭: {response_no_redirect.headers.get('Location')}")

# 運行GET請求演示
if __name__ == "__main__":
    demonstrate_get_requests()

POST請求詳解：

import requests
import json

def demonstrate_post_requests():
    """
    演示各種POST請求的使用方法
    """
    print("=== POST請求詳解 ===")

    # 1. 發送表單數據
    print("\n1. 發送表單數據:")
    form_data = {
        'username': 'testuser',
        'password': 'testpass',
        'email': 'test@example.com',
        'remember': 'on'
    }

    response = requests.post('https://httpbin.org/post', data=form_data)

    if response.status_code == 200:
        result = response.json()
        print(f"發送的表單數據: {form_data}")
        print(f"服務器接收到的表單: {result['form']}")
        print(f"Content-Type: {result['headers'].get('Content-Type')}")

    # 2. 發送JSON數據
    print("\n2. 發送JSON數據:")
    json_data = {
        'name': '王五',
        'age': 28,
        'skills': ['Python', 'JavaScript', 'SQL'],
        'is_active': True,
        'profile': {
            'city': '上海',
            'experience': 5
        }
    }

    # 方法1: 使用json參數（推薦）
    response = requests.post('https://httpbin.org/post', json=json_data)

    if response.status_code == 200:
        result = response.json()
        print(f"發送的JSON數據: {json_data}")
        print(f"服務器接收到的JSON: {result['json']}")
        print(f"Content-Type: {result['headers'].get('Content-Type')}")

    # 方法2: 手動設置headers和data
    headers = {'Content-Type': 'application/json'}
    response2 = requests.post(
        'https://httpbin.org/post', 
        data=json.dumps(json_data), 
        headers=headers
    )
    print(f"\n手動設置方式狀態碼: {response2.status_code}")

    # 3. 發送文件
    print("\n3. 文件上傳:")

    # 創建一個臨時文件用於演示
    import tempfile
    import os

    # 創建臨時文件
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
        f.write("這是一個測試文件\n包含中文內容")
        temp_file_path = f.name

    try:
        # 上傳文件
        with open(temp_file_path, 'rb') as f:
            files = {'file': ('test.txt', f, 'text/plain')}
            response = requests.post('https://httpbin.org/post', files=files)

        if response.status_code == 200:
            result = response.json()
            print(f"上傳的文件信息: {result['files']}")
            print(f"Content-Type: {result['headers'].get('Content-Type')}")

    finally:
        # 清理臨時文件
        os.unlink(temp_file_path)

    # 4. 混合數據提交
    print("\n4. 混合數據提交:")

    # 同時發送表單數據和文件
    form_data = {'description': '文件描述', 'category': 'test'}

    # 創建內存中的文件對象
    from io import StringIO, BytesIO

    file_content = BytesIO(b"Hello, World! This is a test file.")
    files = {'upload': ('hello.txt', file_content, 'text/plain')}

    response = requests.post(
        'https://httpbin.org/post', 
        data=form_data, 
        files=files
    )

    if response.status_code == 200:
        result = response.json()
        print(f"表單數據: {result['form']}")
        print(f"文件數據: {list(result['files'].keys())}")

    # 5. 自定義請求頭的POST
    print("\n5. 自定義請求頭的POST:")

    headers = {
        'User-Agent': 'MyApp/2.0',
        'Authorization': 'Bearer your-token-here',
        'X-Custom-Header': 'custom-value'
    }

    data = {'message': 'Hello from custom headers'}

    response = requests.post(
        'https://httpbin.org/post', 
        json=data, 
        headers=headers
    )

    if response.status_code == 200:
        result = response.json()
        received_headers = result['headers']
        print(f"自定義頭部 X-Custom-Header: {received_headers.get('X-Custom-Header')}")
        print(f"Authorization: {received_headers.get('Authorization')}")

# 運行POST請求演示
if __name__ == "__main__":
    demonstrate_post_requests()

運行結果示例：

=== POST請求詳解 ===

1. 發送表單數據:
發送的表單數據: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
服務器接收到的表單: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
Content-Type: application/x-www-form-urlencoded

2. 發送JSON數據:
發送的JSON數據: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
服務器接收到的JSON: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
Content-Type: application/json

3. 文件上傳:
上傳的文件信息: {'file': '這是一個測試文件\n包含中文內容'}
Content-Type: multipart/form-data; boundary=...

4. 混合數據提交:
表單數據: {'description': '文件描述', 'category': 'test'}
文件數據: ['upload']

5. 自定義請求頭的POST:
自定義頭部 X-Custom-Header: custom-value
Authorization: Bearer your-token-here

請求參數和頭部¶

在網絡爬蟲中，正確設置請求參數和頭部信息是非常重要的，它們決定了服務器如何處理我們的請求。

請求參數詳解¶

import requests
from urllib.parse import urlencode, quote

def advanced_parameters_demo():
    """
    演示高級參數處理
    """
    print("=== 高級參數處理演示 ===")

    # 1. 複雜參數結構
    print("\n1. 複雜參數結構:")

    complex_params = {
        'q': 'Python爬蟲',  # 中文搜索詞
        'page': 1,
        'size': 20,
        'sort': ['time', 'relevance'],  # 多值參數
        'filters': {
            'category': 'tech',
            'date_range': '2024-01-01,2024-12-31'
        },
        'include_fields': ['title', 'content', 'author'],
        'exclude_empty': True
    }

    # Requests會自動處理複雜參數
    response = requests.get('https://httpbin.org/get', params=complex_params)

    print(f"構建的URL: {response.url}")

    result = response.json()
    print(f"\n服務器接收到的參數:")
    for key, value in result['args'].items():
        print(f"  {key}: {value}")

    # 2. 手動URL編碼
    print("\n2. 手動URL編碼:")

    # 處理特殊字符
    special_params = {
        'query': 'hello world & python',
        'symbols': '!@#$%^&*()+={}[]|\\:;"<>?,./'
    }

    # 方法1: 使用requests自動編碼
    response1 = requests.get('https://httpbin.org/get', params=special_params)
    print(f"自動編碼URL: {response1.url}")

    # 方法2: 手動編碼
    encoded_query = quote('hello world & python')
    manual_url = f'https://httpbin.org/get?query={encoded_query}'
    response2 = requests.get(manual_url)
    print(f"手動編碼URL: {response2.url}")

    # 3. 數組參數的不同處理方式
    print("\n3. 數組參數處理:")

    # 方式1: Python列表（默認行爲）
    list_params = {'tags': ['python', 'web', 'crawler']}
    response = requests.get('https://httpbin.org/get', params=list_params)
    print(f"列表參數URL: {response.url}")

    # 方式2: 手動構建重複參數
    manual_params = [('tags', 'python'), ('tags', 'web'), ('tags', 'crawler')]
    response2 = requests.get('https://httpbin.org/get', params=manual_params)
    print(f"手動重複參數URL: {response2.url}")

    # 4. 條件參數構建
    print("\n4. 條件參數構建:")

    def build_search_params(keyword, page=1, filters=None, sort_by=None):
        """
        根據條件構建搜索參數
        """
        params = {'q': keyword, 'page': page}

        if filters:
            for key, value in filters.items():
                if value:  # 只添加非空值
                    params[f'filter_{key}'] = value

        if sort_by:
            params['sort'] = sort_by

        return params

    # 使用條件參數構建
    search_filters = {
        'category': 'technology',
        'author': '',  # 空值，不會被添加
        'date': '2024-01-01'
    }

    params = build_search_params(
        keyword='Python教程',
        page=2,
        filters=search_filters,
        sort_by='date_desc'
    )

    response = requests.get('https://httpbin.org/get', params=params)
    print(f"條件構建的參數: {response.json()['args']}")

# 運行參數演示
if __name__ == "__main__":
    advanced_parameters_demo()

請求頭部詳解¶

import requests
import time
import random

def advanced_headers_demo():
    """
    演示高級請求頭處理
    """
    print("=== 高級請求頭演示 ===")

    # 1. 完整的瀏覽器請求頭模擬
    print("\n1. 完整瀏覽器頭部模擬:")

    browser_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',  # Do Not Track
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0'
    }

    response = requests.get('https://httpbin.org/get', headers=browser_headers)
    received_headers = response.json()['headers']

    print(f"發送的User-Agent: {browser_headers['User-Agent'][:50]}...")
    print(f"服務器接收的User-Agent: {received_headers.get('User-Agent', '')[:50]}...")
    print(f"Accept-Language: {received_headers.get('Accept-Language')}")

    # 2. API請求頭
    print("\n2. API請求頭:")

    api_headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json',
        'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
        'X-API-Key': 'your-api-key-here',
        'X-Client-Version': '1.2.3',
        'X-Request-ID': f'req_{int(time.time())}_{random.randint(1000, 9999)}'
    }

    data = {'query': 'test data'}
    response = requests.post('https://httpbin.org/post', json=data, headers=api_headers)

    if response.status_code == 200:
        result = response.json()
        print(f"API請求成功")
        print(f"Request ID: {result['headers'].get('X-Request-ID')}")
        print(f"Authorization: {result['headers'].get('Authorization', '')[:20]}...")

    # 3. 防爬蟲頭部設置
    print("\n3. 防爬蟲頭部設置:")

    # 模擬真實瀏覽器行爲
    anti_bot_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Referer': 'https://www.google.com/',  # 模擬從搜索引擎來
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache'
    }

    response = requests.get('https://httpbin.org/get', headers=anti_bot_headers)
    print(f"防爬蟲請求狀態: {response.status_code}")
    print(f"Referer頭: {response.json()['headers'].get('Referer')}")

    # 4. 動態頭部生成
    print("\n4. 動態頭部生成:")

    def generate_dynamic_headers():
        """
        生成動態請求頭
        """
        user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
        ]

        referers = [
            'https://www.google.com/',
            'https://www.bing.com/',
            'https://www.baidu.com/',
            'https://duckduckgo.com/'
        ]

        return {
            'User-Agent': random.choice(user_agents),
            'Referer': random.choice(referers),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'X-Forwarded-For': f'{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}'
        }

    # 使用動態頭部發送多個請求
    for i in range(3):
        headers = generate_dynamic_headers()
        response = requests.get('https://httpbin.org/get', headers=headers)

        if response.status_code == 200:
            result = response.json()
            print(f"\n請求 {i+1}:")
            print(f"  User-Agent: {result['headers'].get('User-Agent', '')[:40]}...")
            print(f"  Referer: {result['headers'].get('Referer')}")
            print(f"  X-Forwarded-For: {result['headers'].get('X-Forwarded-For')}")

    # 5. 頭部優先級和覆蓋
    print("\n5. 頭部優先級演示:")

    # 創建會話並設置默認頭部
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'DefaultAgent/1.0',
        'Accept': 'application/json',
        'X-Default-Header': 'default-value'
    })

    # 請求時覆蓋部分頭部
    override_headers = {
        'User-Agent': 'OverrideAgent/2.0',  # 覆蓋默認值
        'X-Custom-Header': 'custom-value'   # 新增頭部
    }

    response = session.get('https://httpbin.org/get', headers=override_headers)

    if response.status_code == 200:
        result = response.json()
        headers = result['headers']
        print(f"最終User-Agent: {headers.get('User-Agent')}")
        print(f"默認Accept: {headers.get('Accept')}")
        print(f"默認頭部: {headers.get('X-Default-Header')}")
        print(f"自定義頭部: {headers.get('X-Custom-Header')}")

# 運行頭部演示
if __name__ == "__main__":
    advanced_headers_demo()

響應對象處理¶

響應對象包含了服務器返回的所有信息，正確處理響應對象是爬蟲開發的關鍵技能。

import requests
import json
from datetime import datetime

def response_handling_demo():
    """
    演示響應對象的各種處理方法
    """
    print("=== 響應對象處理演示 ===")

    # 發送一個測試請求
    response = requests.get('https://httpbin.org/json')

    # 1. 基本響應信息
    print("\n1. 基本響應信息:")
    print(f"狀態碼: {response.status_code}")
    print(f"狀態描述: {response.reason}")
    print(f"請求URL: {response.url}")
    print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
    print(f"編碼: {response.encoding}")

    # 2. 響應頭詳細分析
    print("\n2. 響應頭分析:")
    print(f"Content-Type: {response.headers.get('content-type')}")
    print(f"Content-Length: {response.headers.get('content-length')}")
    print(f"Server: {response.headers.get('server')}")
    print(f"Date: {response.headers.get('date')}")

    # 檢查是否支持壓縮
    content_encoding = response.headers.get('content-encoding')
    if content_encoding:
        print(f"內容編碼: {content_encoding}")
    else:
        print("未使用內容壓縮")

    # 3. 響應內容的不同獲取方式
    print("\n3. 響應內容獲取:")

    # 文本內容
    text_content = response.text
    print(f"文本內容長度: {len(text_content)}字符")
    print(f"文本內容預覽: {text_content[:100]}...")

    # 二進制內容
    binary_content = response.content
    print(f"二進制內容長度: {len(binary_content)}字節")

    # JSON內容
    try:
        json_content = response.json()
        print(f"JSON內容類型: {type(json_content)}")
        if isinstance(json_content, dict):
            print(f"JSON鍵: {list(json_content.keys())}")
    except ValueError as e:
        print(f"JSON解析失敗: {e}")

    # 4. 響應狀態檢查
    print("\n4. 響應狀態檢查:")

    def check_response_status(response):
        """
        檢查響應狀態的詳細信息
        """
        print(f"狀態碼: {response.status_code}")

        # 使用內置方法檢查狀態
        if response.ok:
            print("✓ 請求成功 (狀態碼 200-299)")
        else:
            print("✗ 請求失敗")

        # 詳細狀態分類
        if 200 <= response.status_code < 300:
            print("✓ 成功響應")
        elif 300 <= response.status_code < 400:
            print("→ 重定向響應")
            location = response.headers.get('location')
            if location:
                print(f"  重定向到: {location}")
        elif 400 <= response.status_code < 500:
            print("✗ 客戶端錯誤")
        elif 500 <= response.status_code < 600:
            print("✗ 服務器錯誤")

        # 使用raise_for_status檢查
        try:
            response.raise_for_status()
            print("✓ 狀態檢查通過")
        except requests.exceptions.HTTPError as e:
            print(f"✗ 狀態檢查失敗: {e}")

    check_response_status(response)

    # 5. 測試不同狀態碼的響應
    print("\n5. 不同狀態碼測試:")

    test_urls = [
        ('https://httpbin.org/status/200', '成功'),
        ('https://httpbin.org/status/404', '未找到'),
        ('https://httpbin.org/status/500', '服務器錯誤'),
        ('https://httpbin.org/redirect/1', '重定向')
    ]

    for url, description in test_urls:
        try:
            resp = requests.get(url, timeout=5)
            print(f"\n{description} ({url}):")
            print(f"  狀態碼: {resp.status_code}")
            print(f"  最終URL: {resp.url}")
            if resp.history:
                print(f"  重定向歷史: {[r.status_code for r in resp.history]}")
        except requests.exceptions.RequestException as e:
            print(f"\n{description} 請求失敗: {e}")

    # 6. 響應內容類型處理
    print("\n6. 不同內容類型處理:")

    def handle_different_content_types():
        """
        處理不同類型的響應內容
        """
        # JSON響應
        json_resp = requests.get('https://httpbin.org/json')
        if json_resp.headers.get('content-type', '').startswith('application/json'):
            data = json_resp.json()
            print(f"JSON數據: {data}")

        # HTML響應
        html_resp = requests.get('https://httpbin.org/html')
        if 'text/html' in html_resp.headers.get('content-type', ''):
            print(f"HTML內容長度: {len(html_resp.text)}字符")
            # 可以使用BeautifulSoup進一步解析

        # XML響應
        xml_resp = requests.get('https://httpbin.org/xml')
        if 'application/xml' in xml_resp.headers.get('content-type', ''):
            print(f"XML內容長度: {len(xml_resp.text)}字符")

        # 圖片響應（二進制）
        try:
            img_resp = requests.get('https://httpbin.org/image/png', timeout=10)
            if img_resp.headers.get('content-type', '').startswith('image/'):
                print(f"圖片大小: {len(img_resp.content)}字節")
                print(f"圖片類型: {img_resp.headers.get('content-type')}")
        except requests.exceptions.RequestException:
            print("圖片請求失敗或超時")

    handle_different_content_types()

    # 7. 響應時間和性能分析
    print("\n7. 響應時間分析:")

    def analyze_response_performance(url, num_requests=3):
        """
        分析響應性能
        """
        times = []

        for i in range(num_requests):
            start_time = datetime.now()
            try:
                resp = requests.get(url, timeout=10)
                end_time = datetime.now()

                # 計算總時間
                total_time = (end_time - start_time).total_seconds()
                # 獲取requests內部計時
                elapsed_time = resp.elapsed.total_seconds()

                times.append({
                    'total': total_time,
                    'elapsed': elapsed_time,
                    'status': resp.status_code
                })

                print(f"請求 {i+1}: {elapsed_time:.3f}秒 (狀態碼: {resp.status_code})")

            except requests.exceptions.RequestException as e:
                print(f"請求 {i+1} 失敗: {e}")

        if times:
            avg_time = sum(t['elapsed'] for t in times) / len(times)
            min_time = min(t['elapsed'] for t in times)
            max_time = max(t['elapsed'] for t in times)

            print(f"\n性能統計:")
            print(f"  平均響應時間: {avg_time:.3f}秒")
            print(f"  最快響應時間: {min_time:.3f}秒")
            print(f"  最慢響應時間: {max_time:.3f}秒")

    analyze_response_performance('https://httpbin.org/delay/1')

# 運行響應處理演示
if __name__ == "__main__":
    response_handling_demo()

運行結果示例：

=== 響應對象處理演示 ===

1. 基本響應信息:
狀態碼: 200
狀態描述: OK
請求URL: https://httpbin.org/json
響應時間: 0.234秒
編碼: utf-8

2. 響應頭分析:
Content-Type: application/json
Content-Length: 429
Server: gunicorn/19.9.0
Date: Mon, 15 Jan 2024 06:30:15 GMT
未使用內容壓縮

3. 響應內容獲取:
文本內容長度: 429字符
文本內容預覽: {"slideshow": {"author": "Yours Truly", "date": "date of publication", "slides": [{"title": "Wake up to WonderWidgets!", "type": "all"}, {"title": "Overview", "type": "all", "items": ["Why <em>WonderWidgets</em> are great", "Who <em>buys</em> them"]}], "title": "Sample Slide Show"}}...
二進制內容長度: 429字節
JSON內容類型: <class 'dict'>
JSON鍵: ['slideshow']

4. 響應狀態檢查:
狀態碼: 200
✓ 請求成功 (狀態碼 200-299)
✓ 成功響應
✓ 狀態檢查通過

5. 不同狀態碼測試:

成功 (https://httpbin.org/status/200):
  狀態碼: 200
  最終URL: https://httpbin.org/status/200

未找到 (https://httpbin.org/status/404):
  狀態碼: 404
  最終URL: https://httpbin.org/status/404

服務器錯誤 (https://httpbin.org/status/500):
  狀態碼: 500
  最終URL: https://httpbin.org/status/500

重定向 (https://httpbin.org/redirect/1):
  狀態碼: 200
  最終URL: https://httpbin.org/get
  重定向歷史: [302]

7. 響應時間分析:
請求 1: 1.234秒 (狀態碼: 200)
請求 2: 1.156秒 (狀態碼: 200)
請求 3: 1.298秒 (狀態碼: 200)

性能統計:
  平均響應時間: 1.229秒
  最快響應時間: 1.156秒
  最慢響應時間: 1.298秒

高級功能¶

Session會話管理¶

Session對象允許你跨請求保持某些參數，它會在同一個Session實例發出的所有請求之間保持cookie，使用urllib3的連接池，所以如果你向同一主機發送多個請求，底層的TCP連接將會被重用，從而帶來顯著的性能提升。

import requests
import time
from datetime import datetime

def session_management_demo():
    """
    演示Session會話管理的各種功能
    """
    print("=== Session會話管理演示 ===")

    # 1. 基本Session使用
    print("\n1. 基本Session使用:")

    # 創建Session對象
    session = requests.Session()

    # 設置Session級別的請求頭
    session.headers.update({
        'User-Agent': 'MyApp/1.0',
        'Accept': 'application/json'
    })

    # 使用Session發送請求
    response1 = session.get('https://httpbin.org/get')
    print(f"第一次請求狀態碼: {response1.status_code}")
    print(f"User-Agent: {response1.json()['headers'].get('User-Agent')}")

    # Session會保持設置的頭部
    response2 = session.get('https://httpbin.org/headers')
    print(f"第二次請求User-Agent: {response2.json()['headers'].get('User-Agent')}")

    # 2. Cookie持久化
    print("\n2. Cookie持久化演示:")

    # 創建新的Session
    cookie_session = requests.Session()

    # 第一次請求設置cookie
    response = cookie_session.get('https://httpbin.org/cookies/set/session_id/abc123')
    print(f"設置Cookie後的狀態碼: {response.status_code}")

    # 查看Session中的cookies
    print(f"Session中的Cookies: {dict(cookie_session.cookies)}")

    # 第二次請求會自動攜帶cookie
    response = cookie_session.get('https://httpbin.org/cookies')
    cookies_data = response.json()
    print(f"服務器接收到的Cookies: {cookies_data.get('cookies', {})}")

    # 3. 連接池和性能優化
    print("\n3. 連接池性能對比:")

    def test_without_session(num_requests=5):
        """不使用Session的請求"""
        start_time = time.time()
        for i in range(num_requests):
            response = requests.get('https://httpbin.org/get')
            if response.status_code != 200:
                print(f"請求 {i+1} 失敗")
        end_time = time.time()
        return end_time - start_time

    def test_with_session(num_requests=5):
        """使用Session的請求"""
        start_time = time.time()
        session = requests.Session()
        for i in range(num_requests):
            response = session.get('https://httpbin.org/get')
            if response.status_code != 200:
                print(f"請求 {i+1} 失敗")
        session.close()
        end_time = time.time()
        return end_time - start_time

    print("\n性能測試 (5次請求):")
    time_without_session = test_without_session()
    time_with_session = test_with_session()

    print(f"不使用Session: {time_without_session:.3f}秒")
    print(f"使用Session: {time_with_session:.3f}秒")
    print(f"性能提升: {((time_without_session - time_with_session) / time_without_session * 100):.1f}%")

    # 4. Session配置和自定義
    print("\n4. Session配置:")

    # 創建自定義配置的Session
    custom_session = requests.Session()

    # 設置默認超時
    custom_session.timeout = 10

    # 設置默認參數
    custom_session.params = {'api_key': 'your-api-key'}

    # 設置默認頭部
    custom_session.headers.update({
        'User-Agent': 'CustomBot/2.0',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive'
    })

    # 發送請求
    response = custom_session.get('https://httpbin.org/get', params={'extra': 'param'})

    if response.status_code == 200:
        data = response.json()
        print(f"最終URL: {response.url}")
        print(f"合併後的參數: {data.get('args', {})}")
        print(f"請求頭: {data.get('headers', {}).get('User-Agent')}")

    # 5. Session的請求鉤子
    print("\n5. 請求鉤子演示:")

    def log_request_hook(response, *args, **kwargs):
        """請求日誌鉤子"""
        print(f"[鉤子] 請求: {response.request.method} {response.url}")
        print(f"[鉤子] 狀態碼: {response.status_code}")
        print(f"[鉤子] 響應時間: {response.elapsed.total_seconds():.3f}秒")

    # 創建帶鉤子的Session
    hook_session = requests.Session()
    hook_session.hooks['response'].append(log_request_hook)

    # 發送請求，鉤子會自動執行
    print("\n發送帶鉤子的請求:")
    response = hook_session.get('https://httpbin.org/delay/1')

    # 6. Session上下文管理
    print("\n6. Session上下文管理:")

    # 使用with語句自動管理Session生命週期
    with requests.Session() as s:
        s.headers.update({'User-Agent': 'ContextManager/1.0'})

        response = s.get('https://httpbin.org/get')
        print(f"上下文管理器請求狀態: {response.status_code}")
        print(f"User-Agent: {response.json()['headers'].get('User-Agent')}")
    # Session會自動關閉

    # 7. Session錯誤處理
    print("\n7. Session錯誤處理:")

    error_session = requests.Session()

    # 設置重試適配器
    from requests.adapters import HTTPAdapter
    from urllib3.util.retry import Retry

    retry_strategy = Retry(
        total=3,  # 總重試次數
        backoff_factor=1,  # 重試間隔
        status_forcelist=[429, 500, 502, 503, 504],  # 需要重試的狀態碼
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    error_session.mount("http://", adapter)
    error_session.mount("https://", adapter)

    try:
        # 測試重試機制
        response = error_session.get('https://httpbin.org/status/500', timeout=5)
        print(f"重試後狀態碼: {response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"請求最終失敗: {e}")

    # 8. Session狀態管理
    print("\n8. Session狀態管理:")

    state_session = requests.Session()

    # 模擬登錄流程
    login_data = {
        'username': 'testuser',
        'password': 'testpass'
    }

    # 第一步：獲取登錄頁面（可能包含CSRF token）
    login_page = state_session.get('https://httpbin.org/get')
    print(f"獲取登錄頁面: {login_page.status_code}")

    # 第二步：提交登錄信息
    login_response = state_session.post('https://httpbin.org/post', data=login_data)
    print(f"登錄請求: {login_response.status_code}")

    # 第三步：訪問需要認證的頁面
    protected_response = state_session.get('https://httpbin.org/get')
    print(f"訪問受保護頁面: {protected_response.status_code}")

    # Session會自動維護整個會話狀態
    print(f"會話中的Cookie數量: {len(state_session.cookies)}")

# 運行Session演示
if __name__ == "__main__":
    session_management_demo()

身份驗證¶

Requests支持多種身份驗證方式，包括基本認證、摘要認證、OAuth等。

import requests
from requests.auth import HTTPBasicAuth, HTTPDigestAuth
import base64
import hashlib
import time

def authentication_demo():
    """
    演示各種身份驗證方式
    """
    print("=== 身份驗證演示 ===")

    # 1. HTTP基本認證 (Basic Authentication)
    print("\n1. HTTP基本認證:")

    # 方法1: 使用auth參數
    response = requests.get(
        'https://httpbin.org/basic-auth/user/pass',
        auth=('user', 'pass')
    )
    print(f"基本認證狀態碼: {response.status_code}")
    if response.status_code == 200:
        print(f"認證成功: {response.json()}")

    # 方法2: 使用HTTPBasicAuth類
    response2 = requests.get(
        'https://httpbin.org/basic-auth/testuser/testpass',
        auth=HTTPBasicAuth('testuser', 'testpass')
    )
    print(f"HTTPBasicAuth狀態碼: {response2.status_code}")

    # 方法3: 手動設置Authorization頭
    credentials = base64.b64encode(b'user:pass').decode('ascii')
    headers = {'Authorization': f'Basic {credentials}'}
    response3 = requests.get(
        'https://httpbin.org/basic-auth/user/pass',
        headers=headers
    )
    print(f"手動設置頭部狀態碼: {response3.status_code}")

    # 2. HTTP摘要認證 (Digest Authentication)
    print("\n2. HTTP摘要認證:")

    try:
        response = requests.get(
            'https://httpbin.org/digest-auth/auth/user/pass',
            auth=HTTPDigestAuth('user', 'pass')
        )
        print(f"摘要認證狀態碼: {response.status_code}")
        if response.status_code == 200:
            print(f"摘要認證成功: {response.json()}")
    except Exception as e:
        print(f"摘要認證失敗: {e}")

    # 3. Bearer Token認證
    print("\n3. Bearer Token認證:")

    # 模擬JWT token
    token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"

    headers = {'Authorization': f'Bearer {token}'}
    response = requests.get('https://httpbin.org/bearer', headers=headers)

    print(f"Bearer Token狀態碼: {response.status_code}")
    if response.status_code == 200:
        print(f"Token認證成功: {response.json()}")

    # 4. API Key認證
    print("\n4. API Key認證:")

    # 方法1: 在URL參數中
    api_key = "your-api-key-here"
    response = requests.get(
        'https://httpbin.org/get',
        params={'api_key': api_key}
    )
    print(f"URL參數API Key: {response.json()['args']}")

    # 方法2: 在請求頭中
    headers = {'X-API-Key': api_key}
    response2 = requests.get('https://httpbin.org/get', headers=headers)
    print(f"請求頭API Key: {response2.json()['headers'].get('X-Api-Key')}")

    # 5. 自定義認證類
    print("\n5. 自定義認證類:")

    class CustomAuth(requests.auth.AuthBase):
        """自定義認證類"""

        def __init__(self, api_key, secret_key):
            self.api_key = api_key
            self.secret_key = secret_key

        def __call__(self, r):
            # 生成時間戳
            timestamp = str(int(time.time()))

            # 生成簽名
            string_to_sign = f"{r.method}\n{r.url}\n{timestamp}"
            signature = hashlib.sha256(
                (string_to_sign + self.secret_key).encode('utf-8')
            ).hexdigest()

            # 添加認證頭
            r.headers['X-API-Key'] = self.api_key
            r.headers['X-Timestamp'] = timestamp
            r.headers['X-Signature'] = signature

            return r

    # 使用自定義認證
    custom_auth = CustomAuth('my-api-key', 'my-secret-key')
    response = requests.get('https://httpbin.org/get', auth=custom_auth)

    if response.status_code == 200:
        headers = response.json()['headers']
        print(f"自定義認證頭部:")
        print(f"  X-API-Key: {headers.get('X-Api-Key')}")
        print(f"  X-Timestamp: {headers.get('X-Timestamp')}")
        print(f"  X-Signature: {headers.get('X-Signature', '')[:20]}...")

    # 6. OAuth 2.0 模擬
    print("\n6. OAuth 2.0 模擬:")

    def oauth2_flow_simulation():
        """模擬OAuth 2.0授權流程"""

        # 第一步: 獲取授權碼 (實際應用中用戶會被重定向到授權服務器)
        auth_url = "https://httpbin.org/get"
        auth_params = {
            'response_type': 'code',
            'client_id': 'your-client-id',
            'redirect_uri': 'https://yourapp.com/callback',
            'scope': 'read write',
            'state': 'random-state-string'
        }

        print(f"授權URL: {auth_url}?{'&'.join([f'{k}={v}' for k, v in auth_params.items()])}")

        # 第二步: 使用授權碼獲取訪問令牌
        token_data = {
            'grant_type': 'authorization_code',
            'code': 'received-auth-code',
            'redirect_uri': 'https://yourapp.com/callback',
            'client_id': 'your-client-id',
            'client_secret': 'your-client-secret'
        }

        # 模擬獲取token
        token_response = requests.post('https://httpbin.org/post', data=token_data)
        print(f"Token請求狀態: {token_response.status_code}")

        # 第三步: 使用訪問令牌訪問API
        access_token = "mock-access-token-12345"
        api_headers = {'Authorization': f'Bearer {access_token}'}

        api_response = requests.get('https://httpbin.org/get', headers=api_headers)
        print(f"API訪問狀態: {api_response.status_code}")

        return access_token

    oauth_token = oauth2_flow_simulation()

    # 7. 會話級認證
    print("\n7. 會話級認證:")

    # 創建帶認證的Session
    auth_session = requests.Session()
    auth_session.auth = ('session_user', 'session_pass')

    # 所有通過這個Session的請求都會自動包含認證信息
    response1 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
    print(f"會話認證請求1: {response1.status_code}")

    response2 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
    print(f"會話認證請求2: {response2.status_code}")

    # 8. 認證錯誤處理
    print("\n8. 認證錯誤處理:")

    def handle_auth_errors():
        """處理認證相關錯誤"""

        # 測試錯誤的認證信息
        try:
            response = requests.get(
                'https://httpbin.org/basic-auth/user/pass',
                auth=('wrong_user', 'wrong_pass'),
                timeout=5
            )

            if response.status_code == 401:
                print("✗ 認證失敗: 用戶名或密碼錯誤")
                print(f"  WWW-Authenticate: {response.headers.get('WWW-Authenticate')}")
            elif response.status_code == 403:
                print("✗ 訪問被拒絕: 權限不足")
            else:
                print(f"認證狀態: {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"認證請求異常: {e}")

    handle_auth_errors()

# 運行認證演示
if __name__ == "__main__":
    authentication_demo()

代理設置和SSL配置¶

在爬蟲開發中，代理和SSL配置是非常重要的功能，可以幫助我們繞過網絡限制和確保安全通信。

import requests
import ssl
from requests.adapters import HTTPAdapter
from urllib3.util.ssl_ import create_urllib3_context

def proxy_and_ssl_demo():
    """
    演示代理設置和SSL配置
    """
    print("=== 代理設置和SSL配置演示 ===")

    # 1. HTTP代理設置
    print("\n1. HTTP代理設置:")

    # 基本代理設置
    proxies = {
        'http': 'http://proxy.example.com:8080',
        'https': 'https://proxy.example.com:8080'
    }

    # 注意：這裏使用示例代理，實際運行時需要替換爲真實代理
    print(f"配置的代理: {proxies}")

    # 帶認證的代理
    auth_proxies = {
        'http': 'http://username:password@proxy.example.com:8080',
        'https': 'https://username:password@proxy.example.com:8080'
    }
    print(f"帶認證的代理: {auth_proxies}")

    # 2. SOCKS代理設置
    print("\n2. SOCKS代理設置:")

    # 需要安裝: pip install requests[socks]
    socks_proxies = {
        'http': 'socks5://127.0.0.1:1080',
        'https': 'socks5://127.0.0.1:1080'
    }
    print(f"SOCKS代理配置: {socks_proxies}")

    # 3. 代理輪換
    print("\n3. 代理輪換演示:")

    import random

    proxy_list = [
        {'http': 'http://proxy1.example.com:8080', 'https': 'https://proxy1.example.com:8080'},
        {'http': 'http://proxy2.example.com:8080', 'https': 'https://proxy2.example.com:8080'},
        {'http': 'http://proxy3.example.com:8080', 'https': 'https://proxy3.example.com:8080'}
    ]

    def get_random_proxy():
        """獲取隨機代理"""
        return random.choice(proxy_list)

    # 模擬使用不同代理發送請求
    for i in range(3):
        proxy = get_random_proxy()
        print(f"請求 {i+1} 使用代理: {proxy['http']}")
        # 實際請求代碼:
        # response = requests.get('https://httpbin.org/ip', proxies=proxy, timeout=10)

    # 4. 代理驗證和測試
    print("\n4. 代理驗證:")

    def test_proxy(proxy_dict, test_url='https://httpbin.org/ip'):
        """測試代理是否可用"""
        try:
            response = requests.get(
                test_url,
                proxies=proxy_dict,
                timeout=10
            )

            if response.status_code == 200:
                ip_info = response.json()
                print(f"✓ 代理可用")
                print(f"  出口IP: {ip_info.get('origin')}")
                print(f"  響應時間: {response.elapsed.total_seconds():.3f}秒")
                return True
            else:
                print(f"✗ 代理響應異常: {response.status_code}")
                return False

        except requests.exceptions.ProxyError:
            print("✗ 代理連接失敗")
            return False
        except requests.exceptions.Timeout:
            print("✗ 代理連接超時")
            return False
        except requests.exceptions.RequestException as e:
            print(f"✗ 代理請求異常: {e}")
            return False

    # 測試直連（無代理）
    print("\n測試直連:")
    try:
        direct_response = requests.get('https://httpbin.org/ip', timeout=10)
        if direct_response.status_code == 200:
            ip_info = direct_response.json()
            print(f"✓ 直連成功")
            print(f"  本地IP: {ip_info.get('origin')}")
    except Exception as e:
        print(f"✗ 直連失敗: {e}")

    # 5. SSL配置
    print("\n5. SSL配置演示:")

    # 禁用SSL驗證（不推薦用於生產環境）
    print("\n禁用SSL驗證:")
    try:
        response = requests.get(
            'https://httpbin.org/get',
            verify=False  # 禁用SSL證書驗證
        )
        print(f"✓ 禁用SSL驗證請求成功: {response.status_code}")
    except Exception as e:
        print(f"✗ SSL請求失敗: {e}")

    # 自定義CA證書
    print("\n自定義CA證書:")
    # 指定CA證書文件路徑
    # response = requests.get('https://httpbin.org/get', verify='/path/to/ca-bundle.crt')
    print("可以通過verify參數指定CA證書文件路徑")

    # 客戶端證書認證
    print("\n客戶端證書認證:")
    # cert參數可以是證書文件路徑的字符串，或者是(cert, key)元組
    # response = requests.get('https://httpbin.org/get', cert=('/path/to/client.cert', '/path/to/client.key'))
    print("可以通過cert參數指定客戶端證書")

    # 6. 自定義SSL上下文
    print("\n6. 自定義SSL上下文:")

    class SSLAdapter(HTTPAdapter):
        """自定義SSL適配器"""

        def __init__(self, ssl_context=None, **kwargs):
            self.ssl_context = ssl_context
            super().__init__(**kwargs)

        def init_poolmanager(self, *args, **kwargs):
            kwargs['ssl_context'] = self.ssl_context
            return super().init_poolmanager(*args, **kwargs)

    # 創建自定義SSL上下文
    ssl_context = create_urllib3_context()
    ssl_context.check_hostname = False
    ssl_context.verify_mode = ssl.CERT_NONE

    # 使用自定義SSL適配器
    session = requests.Session()
    session.mount('https://', SSLAdapter(ssl_context))

    try:
        response = session.get('https://httpbin.org/get')
        print(f"✓ 自定義SSL上下文請求成功: {response.status_code}")
    except Exception as e:
        print(f"✗ 自定義SSL請求失敗: {e}")

    # 7. 綜合配置示例
    print("\n7. 綜合配置示例:")

    def create_secure_session(proxy=None, verify_ssl=True, client_cert=None):
        """創建安全配置的Session"""
        session = requests.Session()

        # 設置代理
        if proxy:
            session.proxies.update(proxy)

        # SSL配置
        session.verify = verify_ssl
        if client_cert:
            session.cert = client_cert

        # 設置超時
        session.timeout = 30

        # 設置重試
        from urllib3.util.retry import Retry
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount('http://', adapter)
        session.mount('https://', adapter)

        return session

    # 創建配置好的Session
    secure_session = create_secure_session(
        # proxy={'http': 'http://proxy.example.com:8080'},
        verify_ssl=True
    )

    try:
        response = secure_session.get('https://httpbin.org/get')
        print(f"✓ 安全Session請求成功: {response.status_code}")
        print(f"  SSL驗證: {'啓用' if secure_session.verify else '禁用'}")
        print(f"  代理設置: {secure_session.proxies if secure_session.proxies else '無'}")
    except Exception as e:
        print(f"✗ 安全Session請求失敗: {e}")

    # 8. 環境變量代理配置
    print("\n8. 環境變量代理配置:")

    import os

    # Requests會自動讀取這些環境變量
    env_vars = {
        'HTTP_PROXY': 'http://proxy.example.com:8080',
        'HTTPS_PROXY': 'https://proxy.example.com:8080',
        'NO_PROXY': 'localhost,127.0.0.1,.local'
    }

    print("可以設置的環境變量:")
    for var, value in env_vars.items():
        print(f"  {var}={value}")

    # 檢查當前環境變量
    current_proxy = os.environ.get('HTTP_PROXY') or os.environ.get('http_proxy')
    if current_proxy:
        print(f"當前HTTP代理: {current_proxy}")
    else:
        print("未設置HTTP代理環境變量")

# 運行代理和SSL演示
if __name__ == "__main__":
    proxy_and_ssl_demo()

Cookie是Web應用中維護狀態的重要機制，Requests提供了強大的Cookie處理功能。

import requests
from http.cookies import SimpleCookie
import time
from datetime import datetime, timedelta

def cookie_handling_demo():
    """
    演示Cookie處理的各種功能
    """
    print("=== Cookie處理演示 ===")

    # 1. 基本Cookie操作
    print("\n1. 基本Cookie操作:")

    # 發送帶Cookie的請求
    cookies = {'session_id': 'abc123', 'user_pref': 'dark_mode'}
    response = requests.get('https://httpbin.org/cookies', cookies=cookies)

    if response.status_code == 200:
        received_cookies = response.json().get('cookies', {})
        print(f"發送的Cookies: {cookies}")
        print(f"服務器接收的Cookies: {received_cookies}")

    # 2. 從響應中獲取Cookie
    print("\n2. 從響應中獲取Cookie:")

    # 請求設置Cookie的URL
    response = requests.get('https://httpbin.org/cookies/set/test_cookie/test_value')

    print(f"響應狀態碼: {response.status_code}")
    print(f"響應中的Cookies: {dict(response.cookies)}")

    # 查看Cookie詳細信息
    for cookie in response.cookies:
        print(f"Cookie詳情:")
        print(f"  名稱: {cookie.name}")
        print(f"  值: {cookie.value}")
        print(f"  域: {cookie.domain}")
        print(f"  路徑: {cookie.path}")
        print(f"  過期時間: {cookie.expires}")
        print(f"  安全標誌: {cookie.secure}")
        print(f"  HttpOnly: {cookie.has_nonstandard_attr('HttpOnly')}")

    # 3. Cookie持久化
    print("\n3. Cookie持久化演示:")

    # 創建Session來自動管理Cookie
    session = requests.Session()

    # 第一次請求，服務器設置Cookie
    response1 = session.get('https://httpbin.org/cookies/set/persistent_cookie/persistent_value')
    print(f"第一次請求狀態: {response1.status_code}")
    print(f"Session中的Cookies: {dict(session.cookies)}")

    # 第二次請求，自動攜帶Cookie
    response2 = session.get('https://httpbin.org/cookies')
    if response2.status_code == 200:
        cookies_data = response2.json()
        print(f"第二次請求攜帶的Cookies: {cookies_data.get('cookies', {})}")

    # 4. 手動Cookie管理
    print("\n4. 手動Cookie管理:")

    from requests.cookies import RequestsCookieJar

    # 創建Cookie容器
    jar = RequestsCookieJar()

    # 添加Cookie
    jar.set('custom_cookie', 'custom_value', domain='httpbin.org', path='/')
    jar.set('another_cookie', 'another_value', domain='httpbin.org', path='/')

    # 使用自定義Cookie容器
    response = requests.get('https://httpbin.org/cookies', cookies=jar)

    if response.status_code == 200:
        print(f"自定義Cookie容器: {dict(jar)}")
        print(f"服務器接收: {response.json().get('cookies', {})}")

    # 5. Cookie的高級屬性
    print("\n5. Cookie高級屬性演示:")

    def create_advanced_cookie():
        """創建帶高級屬性的Cookie"""
        jar = RequestsCookieJar()

        # 設置帶過期時間的Cookie
        expire_time = int(time.time()) + 3600  # 1小時後過期
        jar.set(
            'session_token', 
            'token_12345',
            domain='httpbin.org',
            path='/',
            expires=expire_time,
            secure=True,  # 只在HTTPS下傳輸
            rest={'HttpOnly': True}  # 防止JavaScript訪問
        )

        # 設置SameSite屬性
        jar.set(
            'csrf_token',
            'csrf_abc123',
            domain='httpbin.org',
            path='/',
            rest={'SameSite': 'Strict'}
        )

        return jar

    advanced_jar = create_advanced_cookie()
    print(f"高級Cookie容器: {dict(advanced_jar)}")

    # 6. Cookie文件操作
    print("\n6. Cookie文件操作:")

    import pickle
    import os

    # 保存Cookie到文件
    def save_cookies_to_file(session, filename):
        """保存Session的Cookie到文件"""
        with open(filename, 'wb') as f:
            pickle.dump(session.cookies, f)
        print(f"Cookies已保存到: {filename}")

    # 從文件加載Cookie
    def load_cookies_from_file(session, filename):
        """從文件加載Cookie到Session"""
        if os.path.exists(filename):
            with open(filename, 'rb') as f:
                session.cookies.update(pickle.load(f))
            print(f"Cookies已從文件加載: {filename}")
            return True
        return False

    # 演示Cookie文件操作
    cookie_session = requests.Session()

    # 設置一些Cookie
    cookie_session.get('https://httpbin.org/cookies/set/file_cookie/file_value')

    # 保存到文件
    cookie_file = 'session_cookies.pkl'
    save_cookies_to_file(cookie_session, cookie_file)

    # 創建新Session並加載Cookie
    new_session = requests.Session()
    if load_cookies_from_file(new_session, cookie_file):
        response = new_session.get('https://httpbin.org/cookies')
        if response.status_code == 200:
            print(f"加載的Cookies驗證: {response.json().get('cookies', {})}")

    # 清理文件
    if os.path.exists(cookie_file):
        os.remove(cookie_file)
        print(f"已清理Cookie文件: {cookie_file}")

    # 7. Cookie域和路徑管理
    print("\n7. Cookie域和路徑管理:")

    def demonstrate_cookie_scope():
        """演示Cookie的作用域"""
        jar = RequestsCookieJar()

        # 設置不同域和路徑的Cookie
        jar.set('global_cookie', 'global_value', domain='.example.com', path='/')
        jar.set('api_cookie', 'api_value', domain='api.example.com', path='/v1/')
        jar.set('admin_cookie', 'admin_value', domain='admin.example.com', path='/admin/')

        print("Cookie作用域演示:")
        for cookie in jar:
            print(f"  {cookie.name}: 域={cookie.domain}, 路徑={cookie.path}")

        return jar

    scope_jar = demonstrate_cookie_scope()

    # 8. Cookie安全性
    print("\n8. Cookie安全性演示:")

    def create_secure_cookies():
        """創建安全的Cookie設置"""
        jar = RequestsCookieJar()

        # 安全Cookie設置
        security_settings = {
            'session_id': {
                'value': 'secure_session_123',
                'secure': True,  # 只在HTTPS傳輸
                'httponly': True,  # 防止XSS攻擊
                'samesite': 'Strict',  # 防止CSRF攻擊
                'expires': int(time.time()) + 1800  # 30分鐘過期
            },
            'csrf_token': {
                'value': 'csrf_token_456',
                'secure': True,
                'samesite': 'Strict',
                'expires': int(time.time()) + 3600  # 1小時過期
            }
        }

        for name, settings in security_settings.items():
            jar.set(
                name,
                settings['value'],
                domain='httpbin.org',
                path='/',
                expires=settings.get('expires'),
                secure=settings.get('secure', False),
                rest={
                    'HttpOnly': settings.get('httponly', False),
                    'SameSite': settings.get('samesite', 'Lax')
                }
            )

        print("安全Cookie配置:")
        for cookie in jar:
            print(f"  {cookie.name}: 安全={cookie.secure}")

        return jar

    secure_jar = create_secure_cookies()

    # 9. Cookie調試和分析
    print("\n9. Cookie調試和分析:")

    def analyze_cookies(response):
        """分析響應中的Cookie"""
        print("Cookie分析報告:")

        if not response.cookies:
            print("  無Cookie")
            return

        for cookie in response.cookies:
            print(f"\n  Cookie: {cookie.name}")
            print(f"    值: {cookie.value}")
            print(f"    域: {cookie.domain or '未設置'}")
            print(f"    路徑: {cookie.path or '/'}")

            if cookie.expires:
                expire_date = datetime.fromtimestamp(cookie.expires)
                print(f"    過期時間: {expire_date}")

                # 檢查是否即將過期
                if expire_date < datetime.now() + timedelta(hours=1):
                    print(f"    ⚠️  警告: Cookie將在1小時內過期")
            else:
                print(f"    過期時間: 會話結束")

            print(f"    安全標誌: {cookie.secure}")
            print(f"    大小: {len(cookie.value)}字節")

            # 檢查Cookie大小
            if len(cookie.value) > 4000:
                print(f"    ⚠️  警告: Cookie過大，可能被截斷")

    # 分析一個帶Cookie的響應
    test_response = requests.get('https://httpbin.org/cookies/set/analysis_cookie/test_analysis_value')
    analyze_cookies(test_response)

    # 10. Cookie錯誤處理
    print("\n10. Cookie錯誤處理:")

    def handle_cookie_errors():
        """處理Cookie相關錯誤"""
        try:
            # 嘗試設置無效的Cookie
            jar = RequestsCookieJar()

            # 測試各種邊界情況
            test_cases = [
                ('valid_cookie', 'valid_value'),
                ('', 'empty_name'),  # 空名稱
                ('space cookie', 'space_in_name'),  # 名稱包含空格
                ('valid_name', ''),  # 空值
                ('long_cookie', 'x' * 5000),  # 超長值
            ]

            for name, value in test_cases:
                try:
                    jar.set(name, value, domain='httpbin.org')
                    print(f"✓ 成功設置Cookie: {name[:20]}...")
                except Exception as e:
                    print(f"✗ 設置Cookie失敗 ({name[:20]}...): {e}")

            # 測試Cookie發送
            response = requests.get('https://httpbin.org/cookies', cookies=jar, timeout=5)
            print(f"Cookie發送測試: {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"Cookie請求異常: {e}")
        except Exception as e:
            print(f"Cookie處理異常: {e}")

    handle_cookie_errors()

# 運行Cookie演示
if __name__ == "__main__":
    cookie_handling_demo()

文件上傳和下載¶

文件傳輸是網絡爬蟲和自動化中的重要功能，Requests提供了簡單而強大的文件處理能力。

import requests
import os
import io
from pathlib import Path
import mimetypes
import hashlib
from tqdm import tqdm

def file_transfer_demo():
    """
    演示文件上傳和下載功能
    """
    print("=== 文件上傳和下載演示 ===")

    # 1. 基本文件上傳
    print("\n1. 基本文件上傳:")

    # 創建測試文件
    test_file_content = "這是一個測試文件\nTest file content\n測試數據123"
    test_file_path = "test_upload.txt"

    with open(test_file_path, 'w', encoding='utf-8') as f:
        f.write(test_file_content)

    # 方法1: 使用files參數上傳
    with open(test_file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post('https://httpbin.org/post', files=files)

    if response.status_code == 200:
        result = response.json()
        print(f"文件上傳成功")
        print(f"上傳的文件信息: {result.get('files', {})}")

    # 2. 高級文件上傳
    print("\n2. 高級文件上傳:")

    # 指定文件名和MIME類型
    with open(test_file_path, 'rb') as f:
        files = {
            'document': ('custom_name.txt', f, 'text/plain'),
            'metadata': ('info.json', io.StringIO('{"type": "document"}'), 'application/json')
        }

        # 同時發送表單數據
        data = {
            'title': '測試文檔',
            'description': '這是一個測試上傳',
            'category': 'test'
        }

        response = requests.post('https://httpbin.org/post', files=files, data=data)

    if response.status_code == 200:
        result = response.json()
        print(f"高級上傳成功")
        print(f"表單數據: {result.get('form', {})}")
        print(f"文件數據: {list(result.get('files', {}).keys())}")

    # 3. 多文件上傳
    print("\n3. 多文件上傳:")

    # 創建多個測試文件
    test_files = []
    for i in range(3):
        filename = f"test_file_{i+1}.txt"
        content = f"這是測試文件 {i+1}\nFile {i+1} content\n"

        with open(filename, 'w', encoding='utf-8') as f:
            f.write(content)
        test_files.append(filename)

    # 上傳多個文件
    files = []
    for filename in test_files:
        files.append(('files', (filename, open(filename, 'rb'), 'text/plain')))

    try:
        response = requests.post('https://httpbin.org/post', files=files)

        if response.status_code == 200:
            result = response.json()
            print(f"多文件上傳成功")
            print(f"上傳文件數量: {len(result.get('files', {}))}")
    finally:
        # 關閉文件句柄
        for _, (_, file_obj, _) in files:
            file_obj.close()

    # 4. 內存文件上傳
    print("\n4. 內存文件上傳:")

    # 創建內存中的文件
    memory_file = io.BytesIO()
    memory_file.write("內存中的文件內容\nMemory file content".encode('utf-8'))
    memory_file.seek(0)  # 重置指針到開始

    files = {'memory_file': ('memory.txt', memory_file, 'text/plain')}
    response = requests.post('https://httpbin.org/post', files=files)

    if response.status_code == 200:
        print(f"內存文件上傳成功")

    memory_file.close()

    # 5. 文件下載基礎
    print("\n5. 文件下載基礎:")

    # 下載小文件
    download_url = 'https://httpbin.org/json'
    response = requests.get(download_url)

    if response.status_code == 200:
        # 保存到文件
        download_filename = 'downloaded_data.json'
        with open(download_filename, 'wb') as f:
            f.write(response.content)

        print(f"文件下載成功: {download_filename}")
        print(f"文件大小: {len(response.content)}字節")
        print(f"Content-Type: {response.headers.get('content-type')}")

    # 6. 大文件下載（流式下載）
    print("\n6. 大文件流式下載:")

    def download_large_file(url, filename, chunk_size=8192):
        """流式下載大文件"""
        try:
            with requests.get(url, stream=True) as response:
                response.raise_for_status()

                # 獲取文件大小
                total_size = int(response.headers.get('content-length', 0))

                with open(filename, 'wb') as f:
                    if total_size > 0:
                        # 使用進度條
                        with tqdm(total=total_size, unit='B', unit_scale=True, desc=filename) as pbar:
                            for chunk in response.iter_content(chunk_size=chunk_size):
                                if chunk:
                                    f.write(chunk)
                                    pbar.update(len(chunk))
                    else:
                        # 無法獲取文件大小時
                        downloaded = 0
                        for chunk in response.iter_content(chunk_size=chunk_size):
                            if chunk:
                                f.write(chunk)
                                downloaded += len(chunk)
                                print(f"\r已下載: {downloaded}字節", end='', flush=True)
                        print()  # 換行

                print(f"\n✓ 文件下載完成: {filename}")
                return True

        except requests.exceptions.RequestException as e:
            print(f"✗ 下載失敗: {e}")
            return False

    # 演示流式下載（使用較小的文件作爲示例）
    large_file_url = 'https://httpbin.org/bytes/10240'  # 10KB測試文件
    if download_large_file(large_file_url, 'large_download.bin'):
        file_size = os.path.getsize('large_download.bin')
        print(f"下載文件大小: {file_size}字節")

    # 7. 斷點續傳下載
    print("\n7. 斷點續傳下載:")

    def resume_download(url, filename, chunk_size=8192):
        """支持斷點續傳的下載"""
        # 檢查本地文件是否存在
        resume_pos = 0
        if os.path.exists(filename):
            resume_pos = os.path.getsize(filename)
            print(f"發現本地文件，從位置 {resume_pos} 繼續下載")

        # 設置Range頭進行斷點續傳
        headers = {'Range': f'bytes={resume_pos}-'} if resume_pos > 0 else {}

        try:
            response = requests.get(url, headers=headers, stream=True)

            # 檢查服務器是否支持斷點續傳
            if resume_pos > 0 and response.status_code != 206:
                print("服務器不支持斷點續傳，重新下載")
                resume_pos = 0
                response = requests.get(url, stream=True)

            response.raise_for_status()

            # 獲取總文件大小
            if 'content-range' in response.headers:
                total_size = int(response.headers['content-range'].split('/')[-1])
            else:
                total_size = int(response.headers.get('content-length', 0)) + resume_pos

            # 打開文件（追加模式如果是續傳）
            mode = 'ab' if resume_pos > 0 else 'wb'
            with open(filename, mode) as f:
                downloaded = resume_pos

                for chunk in response.iter_content(chunk_size=chunk_size):
                    if chunk:
                        f.write(chunk)
                        downloaded += len(chunk)

                        if total_size > 0:
                            progress = (downloaded / total_size) * 100
                            print(f"\r下載進度: {progress:.1f}% ({downloaded}/{total_size})", end='', flush=True)

                print(f"\n✓ 下載完成: {filename}")
                return True

        except requests.exceptions.RequestException as e:
            print(f"✗ 下載失敗: {e}")
            return False

    # 演示斷點續傳（模擬）
    resume_url = 'https://httpbin.org/bytes/5120'  # 5KB測試文件
    resume_filename = 'resume_download.bin'

    # 先下載一部分（模擬中斷）
    try:
        response = requests.get(resume_url, stream=True)
        with open(resume_filename, 'wb') as f:
            for i, chunk in enumerate(response.iter_content(chunk_size=1024)):
                if i >= 2:  # 只下載前2KB
                    break
                f.write(chunk)
        print(f"模擬下載中斷，已下載: {os.path.getsize(resume_filename)}字節")
    except:
        pass

    # 繼續下載
    resume_download(resume_url, resume_filename)

    # 8. 文件完整性驗證
    print("\n8. 文件完整性驗證:")

    def verify_file_integrity(filename, expected_hash=None, hash_algorithm='md5'):
        """驗證文件完整性"""
        if not os.path.exists(filename):
            print(f"✗ 文件不存在: {filename}")
            return False

        # 計算文件哈希
        hash_obj = hashlib.new(hash_algorithm)
        with open(filename, 'rb') as f:
            for chunk in iter(lambda: f.read(8192), b''):
                hash_obj.update(chunk)

        file_hash = hash_obj.hexdigest()
        print(f"文件 {filename} 的{hash_algorithm.upper()}哈希: {file_hash}")

        if expected_hash:
            if file_hash == expected_hash:
                print(f"✓ 文件完整性驗證通過")
                return True
            else:
                print(f"✗ 文件完整性驗證失敗")
                print(f"  期望: {expected_hash}")
                print(f"  實際: {file_hash}")
                return False

        return True

    # 驗證下載的文件
    for filename in ['downloaded_data.json', 'large_download.bin']:
        if os.path.exists(filename):
            verify_file_integrity(filename)

    # 9. 自動MIME類型檢測
    print("\n9. 自動MIME類型檢測:")

    def upload_with_auto_mime(filename):
        """自動檢測MIME類型並上傳"""
        if not os.path.exists(filename):
            print(f"文件不存在: {filename}")
            return

        # 自動檢測MIME類型
        mime_type, _ = mimetypes.guess_type(filename)
        if mime_type is None:
            mime_type = 'application/octet-stream'  # 默認二進制類型

        print(f"文件: {filename}")
        print(f"檢測到的MIME類型: {mime_type}")

        with open(filename, 'rb') as f:
            files = {'file': (filename, f, mime_type)}
            response = requests.post('https://httpbin.org/post', files=files)

            if response.status_code == 200:
                print(f"✓ 上傳成功")
            else:
                print(f"✗ 上傳失敗: {response.status_code}")

    # 測試不同類型的文件
    test_files_mime = ['test_upload.txt', 'downloaded_data.json']
    for filename in test_files_mime:
        if os.path.exists(filename):
            upload_with_auto_mime(filename)

    # 10. 清理測試文件
    print("\n10. 清理測試文件:")

    cleanup_files = [
        test_file_path, 'downloaded_data.json', 'large_download.bin',
        'resume_download.bin'
    ] + test_files

    for filename in cleanup_files:
        if os.path.exists(filename):
            try:
                os.remove(filename)
                print(f"✓ 已刪除: {filename}")
            except Exception as e:
                print(f"✗ 刪除失敗 {filename}: {e}")

# 運行文件傳輸演示
if __name__ == "__main__":
    file_transfer_demo()

超時和重試機制¶

在網絡請求中，超時和重試機制是確保程序穩定性的重要功能。

import requests
import time
import random
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from functools import wraps
import logging

def timeout_and_retry_demo():
    """
    演示超時和重試機制
    """
    print("=== 超時和重試機制演示 ===")

    # 1. 基本超時設置
    print("\n1. 基本超時設置:")

    # 連接超時和讀取超時
    try:
        # timeout=(連接超時, 讀取超時)
        response = requests.get('https://httpbin.org/delay/2', timeout=(5, 10))
        print(f"請求成功: {response.status_code}")
        print(f"響應時間: {response.elapsed.total_seconds():.2f}秒")
    except requests.exceptions.Timeout as e:
        print(f"請求超時: {e}")
    except requests.exceptions.RequestException as e:
        print(f"請求異常: {e}")

    # 2. 不同類型的超時
    print("\n2. 不同類型的超時演示:")

    def test_different_timeouts():
        """測試不同的超時設置"""
        timeout_configs = [
            ("單一超時", 5),  # 連接和讀取都是5秒
            ("分別設置", (3, 10)),  # 連接3秒，讀取10秒
            ("只設置連接超時", (2, None)),  # 只設置連接超時
        ]

        for desc, timeout in timeout_configs:
            try:
                print(f"\n測試 {desc}: {timeout}")
                start_time = time.time()
                response = requests.get('https://httpbin.org/delay/1', timeout=timeout)
                elapsed = time.time() - start_time
                print(f"  ✓ 成功: {response.status_code}, 耗時: {elapsed:.2f}秒")
            except requests.exceptions.Timeout as e:
                elapsed = time.time() - start_time
                print(f"  ✗ 超時: {elapsed:.2f}秒, {e}")
            except Exception as e:
                print(f"  ✗ 異常: {e}")

    test_different_timeouts()

    # 3. 手動重試機制
    print("\n3. 手動重試機制:")

    def manual_retry(url, max_retries=3, delay=1, backoff=2):
        """手動實現重試機制"""
        for attempt in range(max_retries + 1):
            try:
                print(f"  嘗試 {attempt + 1}/{max_retries + 1}: {url}")
                response = requests.get(url, timeout=5)

                # 檢查響應狀態
                if response.status_code == 200:
                    print(f"  ✓ 成功: {response.status_code}")
                    return response
                elif response.status_code >= 500:
                    # 服務器錯誤，可以重試
                    print(f"  服務器錯誤 {response.status_code}，準備重試")
                    raise requests.exceptions.RequestException(f"Server error: {response.status_code}")
                else:
                    # 客戶端錯誤，不重試
                    print(f"  客戶端錯誤 {response.status_code}，不重試")
                    return response

            except (requests.exceptions.Timeout, 
                   requests.exceptions.ConnectionError,
                   requests.exceptions.RequestException) as e:
                print(f"  ✗ 請求失敗: {e}")

                if attempt < max_retries:
                    wait_time = delay * (backoff ** attempt)
                    print(f"  等待 {wait_time:.1f}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  已達到最大重試次數，放棄")
                    raise

        return None

    # 測試手動重試
    try:
        response = manual_retry('https://httpbin.org/status/500', max_retries=2)
    except Exception as e:
        print(f"手動重試最終失敗: {e}")

    # 4. 使用urllib3的重試策略
    print("\n4. urllib3重試策略:")

    def create_retry_session():
        """創建帶重試策略的Session"""
        session = requests.Session()

        # 定義重試策略
        retry_strategy = Retry(
            total=3,  # 總重試次數
            status_forcelist=[429, 500, 502, 503, 504],  # 需要重試的狀態碼
            method_whitelist=["HEAD", "GET", "OPTIONS"],  # 允許重試的方法
            backoff_factor=1,  # 退避因子
            raise_on_redirect=False,
            raise_on_status=False
        )

        # 創建適配器
        adapter = HTTPAdapter(max_retries=retry_strategy)

        # 掛載適配器
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        return session

    # 使用重試Session
    retry_session = create_retry_session()

    try:
        print("使用重試Session請求:")
        response = retry_session.get('https://httpbin.org/status/503', timeout=10)
        print(f"最終響應: {response.status_code}")
    except Exception as e:
        print(f"重試Session失敗: {e}")

    # 5. 高級重試配置
    print("\n5. 高級重試配置:")

    def create_advanced_retry_session():
        """創建高級重試配置的Session"""
        session = requests.Session()

        # 高級重試策略
        retry_strategy = Retry(
            total=5,  # 總重試次數
            read=3,   # 讀取重試次數
            connect=3,  # 連接重試次數
            status=3,   # 狀態碼重試次數
            status_forcelist=[408, 429, 500, 502, 503, 504, 520, 522, 524],
            method_whitelist=["HEAD", "GET", "PUT", "DELETE", "OPTIONS", "TRACE"],
            backoff_factor=0.3,  # 退避因子：{backoff factor} * (2 ** ({number of total retries} - 1))
            raise_on_redirect=False,
            raise_on_status=False,
            respect_retry_after_header=True  # 尊重服務器的Retry-After頭
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        return session

    advanced_session = create_advanced_retry_session()

    # 測試高級重試
    test_urls = [
        ('正常請求', 'https://httpbin.org/get'),
        ('服務器錯誤', 'https://httpbin.org/status/500'),
        ('超時請求', 'https://httpbin.org/delay/3')
    ]

    for desc, url in test_urls:
        try:
            print(f"\n測試 {desc}:")
            start_time = time.time()
            response = advanced_session.get(url, timeout=(5, 10))
            elapsed = time.time() - start_time
            print(f"  ✓ 響應: {response.status_code}, 耗時: {elapsed:.2f}秒")
        except Exception as e:
            elapsed = time.time() - start_time
            print(f"  ✗ 失敗: {e}, 耗時: {elapsed:.2f}秒")

    # 6. 裝飾器重試
    print("\n6. 裝飾器重試:")

    def retry_decorator(max_retries=3, delay=1, backoff=2, exceptions=(Exception,)):
        """重試裝飾器"""
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries + 1):
                    try:
                        return func(*args, **kwargs)
                    except exceptions as e:
                        if attempt == max_retries:
                            print(f"裝飾器重試失敗，已達最大次數: {e}")
                            raise

                        wait_time = delay * (backoff ** attempt)
                        print(f"裝飾器重試 {attempt + 1}/{max_retries + 1} 失敗: {e}")
                        print(f"等待 {wait_time:.1f}秒 後重試...")
                        time.sleep(wait_time)

            return wrapper
        return decorator

    @retry_decorator(max_retries=2, delay=0.5, exceptions=(requests.exceptions.RequestException,))
    def unreliable_request(url):
        """不穩定的請求函數"""
        # 模擬隨機失敗
        if random.random() < 0.7:  # 70%概率失敗
            raise requests.exceptions.ConnectionError("模擬連接失敗")

        response = requests.get(url, timeout=5)
        return response

    # 測試裝飾器重試
    try:
        print("測試裝飾器重試:")
        response = unreliable_request('https://httpbin.org/get')
        print(f"裝飾器重試成功: {response.status_code}")
    except Exception as e:
        print(f"裝飾器重試最終失敗: {e}")

    # 7. 智能重試策略
    print("\n7. 智能重試策略:")

    class SmartRetry:
        """智能重試類"""

        def __init__(self, max_retries=3, base_delay=1, max_delay=60):
            self.max_retries = max_retries
            self.base_delay = base_delay
            self.max_delay = max_delay
            self.attempt_count = 0

        def should_retry(self, exception, response=None):
            """判斷是否應該重試"""
            # 網絡相關異常應該重試
            if isinstance(exception, (requests.exceptions.Timeout,
                                    requests.exceptions.ConnectionError)):
                return True

            # 特定狀態碼應該重試
            if response and response.status_code in [429, 500, 502, 503, 504]:
                return True

            return False

        def get_delay(self):
            """計算延遲時間"""
            # 指數退避 + 隨機抖動
            delay = min(self.base_delay * (2 ** self.attempt_count), self.max_delay)
            jitter = random.uniform(0, 0.1) * delay  # 10%的隨機抖動
            return delay + jitter

        def execute(self, func, *args, **kwargs):
            """執行帶重試的函數"""
            last_exception = None

            for attempt in range(self.max_retries + 1):
                self.attempt_count = attempt

                try:
                    result = func(*args, **kwargs)

                    # 如果是Response對象，檢查狀態碼
                    if hasattr(result, 'status_code'):
                        if self.should_retry(None, result) and attempt < self.max_retries:
                            print(f"智能重試: 狀態碼 {result.status_code}，嘗試 {attempt + 1}")
                            time.sleep(self.get_delay())
                            continue

                    print(f"智能重試成功，嘗試次數: {attempt + 1}")
                    return result

                except Exception as e:
                    last_exception = e

                    if self.should_retry(e) and attempt < self.max_retries:
                        delay = self.get_delay()
                        print(f"智能重試: {e}，等待 {delay:.2f}秒，嘗試 {attempt + 1}")
                        time.sleep(delay)
                    else:
                        break

            print(f"智能重試失敗，已達最大次數")
            raise last_exception

    # 測試智能重試
    smart_retry = SmartRetry(max_retries=3, base_delay=0.5)

    def test_request():
        # 模擬不穩定的請求
        if random.random() < 0.6:
            raise requests.exceptions.ConnectionError("模擬網絡錯誤")
        return requests.get('https://httpbin.org/get', timeout=5)

    try:
        response = smart_retry.execute(test_request)
        print(f"智能重試最終成功: {response.status_code}")
    except Exception as e:
        print(f"智能重試最終失敗: {e}")

    # 8. 重試監控和日誌
    print("\n8. 重試監控和日誌:")

    # 配置日誌
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logger = logging.getLogger(__name__)

    class MonitoredRetry:
        """帶監控的重試類"""

        def __init__(self, max_retries=3):
            self.max_retries = max_retries
            self.stats = {
                'total_attempts': 0,
                'successful_attempts': 0,
                'failed_attempts': 0,
                'retry_reasons': {}
            }

        def request_with_monitoring(self, url, **kwargs):
            """帶監控的請求"""
            for attempt in range(self.max_retries + 1):
                self.stats['total_attempts'] += 1

                try:
                    logger.info(f"嘗試請求 {url}，第 {attempt + 1} 次")
                    response = requests.get(url, **kwargs)

                    if response.status_code == 200:
                        self.stats['successful_attempts'] += 1
                        logger.info(f"請求成功: {response.status_code}")
                        return response
                    else:
                        reason = f"status_{response.status_code}"
                        self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1

                        if attempt < self.max_retries:
                            logger.warning(f"請求失敗: {response.status_code}，準備重試")
                            time.sleep(1)
                        else:
                            logger.error(f"請求最終失敗: {response.status_code}")
                            return response

                except Exception as e:
                    reason = type(e).__name__
                    self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1

                    if attempt < self.max_retries:
                        logger.warning(f"請求異常: {e}，準備重試")
                        time.sleep(1)
                    else:
                        self.stats['failed_attempts'] += 1
                        logger.error(f"請求最終異常: {e}")
                        raise

        def get_stats(self):
            """獲取統計信息"""
            return self.stats

    # 測試監控重試
    monitored_retry = MonitoredRetry(max_retries=2)

    test_urls_monitor = [
        'https://httpbin.org/get',
        'https://httpbin.org/status/500',
        'https://httpbin.org/delay/1'
    ]

    for url in test_urls_monitor:
        try:
            response = monitored_retry.request_with_monitoring(url, timeout=3)
            print(f"監控請求結果: {response.status_code if response else 'None'}")
        except Exception as e:
            print(f"監控請求異常: {e}")

    # 顯示統計信息
    stats = monitored_retry.get_stats()
    print(f"\n重試統計信息:")
    print(f"  總嘗試次數: {stats['total_attempts']}")
    print(f"  成功次數: {stats['successful_attempts']}")
    print(f"  失敗次數: {stats['failed_attempts']}")
    print(f"  重試原因: {stats['retry_reasons']}")

    # 9. 超時和重試的最佳實踐
    print("\n9. 超時和重試最佳實踐:")

    def best_practice_request(url, max_retries=3, timeout=(5, 30)):
        """最佳實踐的請求函數"""
        session = requests.Session()

        # 配置重試策略
        retry_strategy = Retry(
            total=max_retries,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"],
            backoff_factor=1,
            respect_retry_after_header=True
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        # 設置默認超時
        session.timeout = timeout

        try:
            response = session.get(url)
            response.raise_for_status()  # 拋出HTTP錯誤
            return response
        except requests.exceptions.Timeout:
            print(f"請求超時: {url}")
            raise
        except requests.exceptions.ConnectionError:
            print(f"連接錯誤: {url}")
            raise
        except requests.exceptions.HTTPError as e:
            print(f"HTTP錯誤: {e}")
            raise
        except requests.exceptions.RequestException as e:
            print(f"請求異常: {e}")
            raise
        finally:
            session.close()

    # 測試最佳實踐
    try:
        response = best_practice_request('https://httpbin.org/get')
        print(f"最佳實踐請求成功: {response.status_code}")
    except Exception as e:
        print(f"最佳實踐請求失敗: {e}")

# 運行超時和重試演示
if __name__ == "__main__":
    timeout_and_retry_demo()

異常處理¶

完善的異常處理是構建穩定爬蟲程序的關鍵。

import requests
import json
from requests.exceptions import (
    RequestException, Timeout, ConnectionError, HTTPError,
    URLRequired, TooManyRedirects, MissingSchema, InvalidSchema,
    InvalidURL, InvalidHeader, ChunkedEncodingError, ContentDecodingError,
    StreamConsumedError, RetryError, UnrewindableBodyError
)
import logging
from datetime import datetime

def exception_handling_demo():
    """
    演示Requests異常處理
    """
    print("=== Requests異常處理演示 ===")

    # 配置日誌
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    logger = logging.getLogger(__name__)

    # 1. 基本異常類型
    print("\n1. 基本異常類型演示:")

    def demonstrate_basic_exceptions():
        """演示基本異常類型"""

        # 異常測試用例
        test_cases = [
            {
                'name': '正常請求',
                'url': 'https://httpbin.org/get',
                'expected': 'success'
            },
            {
                'name': '連接超時',
                'url': 'https://httpbin.org/delay/10',
                'timeout': 2,
                'expected': 'timeout'
            },
            {
                'name': '無效URL',
                'url': 'invalid-url',
                'expected': 'invalid_url'
            },
            {
                'name': '不存在的域名',
                'url': 'https://this-domain-does-not-exist-12345.com',
                'expected': 'connection_error'
            },
            {
                'name': 'HTTP錯誤狀態',
                'url': 'https://httpbin.org/status/404',
                'expected': 'http_error'
            },
            {
                'name': '服務器錯誤',
                'url': 'https://httpbin.org/status/500',
                'expected': 'server_error'
            }
        ]

        for case in test_cases:
            print(f"\n測試: {case['name']}")

            try:
                kwargs = {}
                if 'timeout' in case:
                    kwargs['timeout'] = case['timeout']

                response = requests.get(case['url'], **kwargs)

                # 檢查HTTP狀態碼
                if response.status_code >= 400:
                    response.raise_for_status()

                print(f"  ✓ 成功: {response.status_code}")

            except Timeout as e:
                print(f"  ✗ 超時異常: {e}")
                logger.warning(f"請求超時: {case['url']}")

            except ConnectionError as e:
                print(f"  ✗ 連接異常: {e}")
                logger.error(f"連接失敗: {case['url']}")

            except HTTPError as e:
                print(f"  ✗ HTTP異常: {e}")
                print(f"    狀態碼: {e.response.status_code}")
                print(f"    原因: {e.response.reason}")
                logger.error(f"HTTP錯誤: {case['url']} - {e.response.status_code}")

            except InvalidURL as e:
                print(f"  ✗ 無效URL: {e}")
                logger.error(f"URL格式錯誤: {case['url']}")

            except MissingSchema as e:
                print(f"  ✗ 缺少協議: {e}")
                logger.error(f"URL缺少協議: {case['url']}")

            except RequestException as e:
                print(f"  ✗ 請求異常: {e}")
                logger.error(f"通用請求異常: {case['url']} - {e}")

            except Exception as e:
                print(f"  ✗ 未知異常: {e}")
                logger.critical(f"未知異常: {case['url']} - {e}")

    demonstrate_basic_exceptions()

    # 2. 異常層次結構
    print("\n2. 異常層次結構:")

    def show_exception_hierarchy():
        """顯示異常層次結構"""
        exceptions_hierarchy = {
            'RequestException': {
                'description': '所有Requests異常的基類',
                'children': {
                    'HTTPError': '4xx和5xx HTTP狀態碼異常',
                    'ConnectionError': '連接相關異常',
                    'Timeout': '超時異常',
                    'URLRequired': '缺少URL異常',
                    'TooManyRedirects': '重定向次數過多異常',
                    'MissingSchema': '缺少URL協議異常',
                    'InvalidSchema': '無效URL協議異常',
                    'InvalidURL': '無效URL異常',
                    'InvalidHeader': '無效請求頭異常',
                    'ChunkedEncodingError': '分塊編碼錯誤',
                    'ContentDecodingError': '內容解碼錯誤',
                    'StreamConsumedError': '流已消費錯誤',
                    'RetryError': '重試錯誤',
                    'UnrewindableBodyError': '不可重繞請求體錯誤'
                }
            }
        }

        print("Requests異常層次結構:")
        for parent, info in exceptions_hierarchy.items():
            print(f"\n{parent}: {info['description']}")
            for child, desc in info['children'].items():
                print(f"  ├── {child}: {desc}")

    show_exception_hierarchy()

    # 3. 詳細異常處理
    print("\n3. 詳細異常處理:")

    def detailed_exception_handling(url, **kwargs):
        """詳細的異常處理函數"""
        try:
            print(f"請求: {url}")
            response = requests.get(url, **kwargs)
            response.raise_for_status()

            print(f"  ✓ 成功: {response.status_code}")
            return response

        except Timeout as e:
            error_info = {
                'type': 'Timeout',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '增加超時時間或檢查網絡連接'
            }
            print(f"  ✗ 超時: {error_info}")
            return None

        except ConnectionError as e:
            error_info = {
                'type': 'ConnectionError',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查網絡連接、DNS設置或目標服務器狀態'
            }
            print(f"  ✗ 連接錯誤: {error_info}")
            return None

        except HTTPError as e:
            status_code = e.response.status_code
            error_info = {
                'type': 'HTTPError',
                'status_code': status_code,
                'reason': e.response.reason,
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'response_headers': dict(e.response.headers),
                'suggestion': get_http_error_suggestion(status_code)
            }
            print(f"  ✗ HTTP錯誤: {error_info}")
            return e.response

        except InvalidURL as e:
            error_info = {
                'type': 'InvalidURL',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查URL格式是否正確'
            }
            print(f"  ✗ 無效URL: {error_info}")
            return None

        except RequestException as e:
            error_info = {
                'type': 'RequestException',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查請求參數和網絡環境'
            }
            print(f"  ✗ 請求異常: {error_info}")
            return None

    def get_http_error_suggestion(status_code):
        """根據HTTP狀態碼提供建議"""
        suggestions = {
            400: '檢查請求參數格式',
            401: '檢查身份驗證信息',
            403: '檢查訪問權限',
            404: '檢查URL路徑是否正確',
            405: '檢查HTTP方法是否正確',
            429: '降低請求頻率，實現重試機制',
            500: '服務器內部錯誤，稍後重試',
            502: '網關錯誤，檢查代理設置',
            503: '服務不可用，稍後重試',
            504: '網關超時，增加超時時間'
        }
        return suggestions.get(status_code, '查看服務器文檔或聯繫管理員')

    # 測試詳細異常處理
    test_urls = [
        'https://httpbin.org/get',
        'https://httpbin.org/status/401',
        'https://httpbin.org/delay/5',
        'invalid-url-format'
    ]

    for url in test_urls:
        detailed_exception_handling(url, timeout=3)

    # 4. 異常重試策略
    print("\n4. 異常重試策略:")

    def exception_based_retry(url, max_retries=3, **kwargs):
        """基於異常類型的重試策略"""

        # 定義可重試的異常
        retryable_exceptions = (
            Timeout,
            ConnectionError,
            ChunkedEncodingError,
            ContentDecodingError
        )

        # 定義可重試的HTTP狀態碼
        retryable_status_codes = [429, 500, 502, 503, 504]

        last_exception = None

        for attempt in range(max_retries + 1):
            try:
                print(f"嘗試 {attempt + 1}/{max_retries + 1}: {url}")
                response = requests.get(url, **kwargs)

                # 檢查狀態碼是否需要重試
                if response.status_code in retryable_status_codes and attempt < max_retries:
                    print(f"  狀態碼 {response.status_code} 需要重試")
                    time.sleep(2 ** attempt)  # 指數退避
                    continue

                response.raise_for_status()
                print(f"  ✓ 成功: {response.status_code}")
                return response

            except retryable_exceptions as e:
                last_exception = e
                if attempt < max_retries:
                    wait_time = 2 ** attempt
                    print(f"  可重試異常 {type(e).__name__}: {e}")
                    print(f"  等待 {wait_time}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  重試次數已用完")
                    break

            except HTTPError as e:
                if e.response.status_code in retryable_status_codes and attempt < max_retries:
                    wait_time = 2 ** attempt
                    print(f"  HTTP錯誤 {e.response.status_code} 可重試")
                    print(f"  等待 {wait_time}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  HTTP錯誤 {e.response.status_code} 不可重試")
                    raise

            except RequestException as e:
                print(f"  不可重試異常: {e}")
                raise

        # 如果所有重試都失敗了
        if last_exception:
            raise last_exception

    # 測試異常重試
    retry_test_urls = [
        'https://httpbin.org/status/503',
        'https://httpbin.org/delay/2'
    ]

    for url in retry_test_urls:
        try:
            response = exception_based_retry(url, max_retries=2, timeout=3)
            print(f"重試成功: {response.status_code}")
        except Exception as e:
            print(f"重試失敗: {e}")

    # 5. 異常日誌記錄
    print("\n5. 異常日誌記錄:")

    class RequestLogger:
        """請求日誌記錄器"""

        def __init__(self, logger_name='requests_logger'):
            self.logger = logging.getLogger(logger_name)

            # 創建文件處理器
            file_handler = logging.FileHandler('requests_errors.log')
            file_handler.setLevel(logging.ERROR)

            # 創建控制檯處理器
            console_handler = logging.StreamHandler()
            console_handler.setLevel(logging.INFO)

            # 創建格式器
            formatter = logging.Formatter(
                '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
            )
            file_handler.setFormatter(formatter)
            console_handler.setFormatter(formatter)

            # 添加處理器
            self.logger.addHandler(file_handler)
            self.logger.addHandler(console_handler)
            self.logger.setLevel(logging.INFO)

        def log_request(self, method, url, **kwargs):
            """記錄請求信息"""
            self.logger.info(f"發起請求: {method.upper()} {url}")
            if kwargs:
                self.logger.debug(f"請求參數: {kwargs}")

        def log_response(self, response):
            """記錄響應信息"""
            self.logger.info(
                f"收到響應: {response.status_code} {response.reason} "
                f"({len(response.content)}字節)"
            )

        def log_exception(self, exception, url, context=None):
            """記錄異常信息"""
            error_data = {
                'exception_type': type(exception).__name__,
                'exception_message': str(exception),
                'url': url,
                'timestamp': datetime.now().isoformat()
            }

            if context:
                error_data.update(context)

            self.logger.error(f"請求異常: {json.dumps(error_data, ensure_ascii=False)}")

        def safe_request(self, method, url, **kwargs):
            """安全的請求方法"""
            self.log_request(method, url, **kwargs)

            try:
                response = requests.request(method, url, **kwargs)
                self.log_response(response)
                response.raise_for_status()
                return response

            except Exception as e:
                context = {
                    'method': method,
                    'kwargs': {k: str(v) for k, v in kwargs.items()}
                }
                self.log_exception(e, url, context)
                raise

    # 測試日誌記錄
    request_logger = RequestLogger()

    test_requests = [
        ('GET', 'https://httpbin.org/get'),
        ('GET', 'https://httpbin.org/status/404'),
        ('POST', 'https://httpbin.org/post', {'json': {'test': 'data'}})
    ]

    for method, url, *args in test_requests:
        kwargs = args[0] if args else {}
        try:
            response = request_logger.safe_request(method, url, **kwargs)
            print(f"日誌請求成功: {response.status_code}")
        except Exception as e:
            print(f"日誌請求失敗: {e}")

    # 6. 自定義異常類
    print("\n6. 自定義異常類:")

    class CustomRequestException(RequestException):
        """自定義請求異常"""
        pass

    class RateLimitException(CustomRequestException):
        """頻率限制異常"""
        def __init__(self, message, retry_after=None):
            super().__init__(message)
            self.retry_after = retry_after

    class DataValidationException(CustomRequestException):
        """數據驗證異常"""
        def __init__(self, message, validation_errors=None):
            super().__init__(message)
            self.validation_errors = validation_errors or []

    def custom_request_handler(url, **kwargs):
        """使用自定義異常的請求處理器"""
        try:
            response = requests.get(url, **kwargs)

            # 檢查特定狀態碼並拋出自定義異常
            if response.status_code == 429:
                retry_after = response.headers.get('Retry-After')
                raise RateLimitException(
                    "請求頻率過高",
                    retry_after=retry_after
                )

            if response.status_code == 422:
                try:
                    error_data = response.json()
                    validation_errors = error_data.get('errors', [])
                    raise DataValidationException(
                        "數據驗證失敗",
                        validation_errors=validation_errors
                    )
                except ValueError:
                    raise DataValidationException("數據驗證失敗")

            response.raise_for_status()
            return response

        except RateLimitException as e:
            print(f"頻率限制: {e}")
            if e.retry_after:
                print(f"建議等待: {e.retry_after}秒")
            raise

        except DataValidationException as e:
            print(f"數據驗證錯誤: {e}")
            if e.validation_errors:
                print(f"驗證錯誤詳情: {e.validation_errors}")
            raise

    # 測試自定義異常
    try:
        response = custom_request_handler('https://httpbin.org/status/429')
    except RateLimitException as e:
        print(f"捕獲自定義異常: {e}")
    except Exception as e:
        print(f"其他異常: {e}")

# 運行異常處理演示
if __name__ == "__main__":
    exception_handling_demo()

通過以上詳細的代碼示例和說明，我們完成了14.2節Requests庫網絡請求的全部內容。這一節涵蓋了從基礎使用到高級功能的各個方面，包括GET/POST請求、參數處理、響應對象、Session管理、身份驗證、代理設置、SSL配置、Cookie處理、文件上傳下載、超時重試機制和異常處理等核心功能。每個功能都提供了實用的代碼示例和真實的運行結果，幫助讀者深入理解和掌握Requests庫的使用。
- 基本認證
- OAuth認證
- Token認證
- 自定義認證
- 代理和SSL
- 代理服務器配置
- SSL證書驗證
- HTTPS請求處理
- 安全連接設置

14.3 BeautifulSoup網頁解析¶

BeautifulSoup是Python中最流行的HTML和XML解析庫之一，它提供了簡單易用的API來解析、導航、搜索和修改解析樹。本節將詳細介紹BeautifulSoup的各種功能和使用技巧。

BeautifulSoup基礎¶

BeautifulSoup的安裝和基本概念是學習網頁解析的第一步。

# 首先需要安裝BeautifulSoup4
# pip install beautifulsoup4
# pip install lxml  # 推薦的解析器
# pip install html5lib  # 另一個解析器選項

import requests
from bs4 import BeautifulSoup, Comment, NavigableString
import re
from urllib.parse import urljoin, urlparse
import json

def beautifulsoup_basics_demo():
    """
    演示BeautifulSoup基礎功能
    """
    print("=== BeautifulSoup基礎功能演示 ===")

    # 1. 基本使用和解析器
    print("\n1. 基本使用和解析器:")

    # 示例HTML內容
    html_content = """
    <!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="UTF-8">
        <title>BeautifulSoup示例頁面</title>
        <style>
            .highlight { color: red; }
            #main { background: #f0f0f0; }
        </style>
    </head>
    <body>
        <div id="main" class="container">
            <h1 class="title">網頁解析示例</h1>
            <p class="intro">這是一個用於演示BeautifulSoup功能的示例頁面。</p>

            <div class="content">
                <h2>文章列表</h2>
                <ul class="article-list">
                    <li><a href="/article/1" data-id="1">Python基礎教程</a></li>
                    <li><a href="/article/2" data-id="2">網絡爬蟲入門</a></li>
                    <li><a href="/article/3" data-id="3">數據分析實戰</a></li>
                </ul>
            </div>

            <div class="sidebar">
                <h3>相關鏈接</h3>
                <a href="https://python.org" target="_blank">Python官網</a>
                <a href="https://docs.python.org" target="_blank">Python文檔</a>
            </div>

            <!-- 這是一個註釋 -->
            <footer>
                <p>&copy; 2024 示例網站</p>
            </footer>
        </div>
    </body>
    </html>
    """

    # 不同解析器的比較
    parsers = [
        ('html.parser', '內置解析器，速度適中，容錯性一般'),
        ('lxml', '速度最快，功能強大，需要安裝lxml庫'),
        ('html5lib', '最好的容錯性，解析方式與瀏覽器相同，速度較慢')
    ]

    print("可用的解析器:")
    for parser, description in parsers:
        try:
            soup = BeautifulSoup(html_content, parser)
            print(f"  ✓ {parser}: {description}")
        except Exception as e:
            print(f"  ✗ {parser}: 不可用 - {e}")

    # 使用默認解析器創建BeautifulSoup對象
    soup = BeautifulSoup(html_content, 'html.parser')

    # 2. 基本屬性和方法
    print("\n2. 基本屬性和方法:")

    print(f"文檔類型: {type(soup)}")
    print(f"解析器: {soup.parser}")
    print(f"文檔標題: {soup.title}")
    print(f"標題文本: {soup.title.string}")
    print(f"HTML標籤: {soup.html.name}")

    # 獲取所有文本內容
    all_text = soup.get_text()
    print(f"所有文本長度: {len(all_text)}字符")
    print(f"文本預覽: {all_text[:100]}...")

    # 3. 標籤對象的屬性
    print("\n3. 標籤對象的屬性:")

    # 獲取第一個div標籤
    first_div = soup.find('div')
    print(f"標籤名: {first_div.name}")
    print(f"標籤屬性: {first_div.attrs}")
    print(f"id屬性: {first_div.get('id')}")
    print(f"class屬性: {first_div.get('class')}")

    # 檢查屬性是否存在
    print(f"是否有id屬性: {first_div.has_attr('id')}")
    print(f"是否有title屬性: {first_div.has_attr('title')}")

    # 4. 導航樹結構
    print("\n4. 導航樹結構:")

    # 父子關係
    title_tag = soup.title
    print(f"title標籤: {title_tag}")
    print(f"父標籤: {title_tag.parent.name}")
    print(f"子元素數量: {len(list(title_tag.children))}")

    # 兄弟關係
    h1_tag = soup.find('h1')
    print(f"h1標籤: {h1_tag}")

    # 下一個兄弟元素
    next_sibling = h1_tag.find_next_sibling()
    if next_sibling:
        print(f"下一個兄弟元素: {next_sibling.name}")

    # 上一個兄弟元素
    p_tag = soup.find('p')
    prev_sibling = p_tag.find_previous_sibling()
    if prev_sibling:
        print(f"p標籤的上一個兄弟: {prev_sibling.name}")

    # 5. 內容類型
    print("\n5. 內容類型:")

    # 遍歷所有內容
    body_tag = soup.body
    content_types = {}

    for content in body_tag.descendants:
        content_type = type(content).__name__
        content_types[content_type] = content_types.get(content_type, 0) + 1

    print("內容類型統計:")
    for content_type, count in content_types.items():
        print(f"  {content_type}: {count}")

    # 查找註釋
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))
    print(f"\n找到 {len(comments)} 個註釋:")
    for comment in comments:
        print(f"  註釋: {comment.strip()}")

    # 6. 編碼處理
    print("\n6. 編碼處理:")

    # 檢測原始編碼
    print(f"檢測到的編碼: {soup.original_encoding}")

    # 不同編碼的HTML
    utf8_html = "<html><head><title>中文測試</title></head><body><p>你好世界</p></body></html>"

    # 指定編碼解析
    soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
    print(f"UTF-8解析結果: {soup_utf8.title.string}")

    # 轉換爲不同編碼
    print(f"轉爲UTF-8: {soup_utf8.encode('utf-8')[:50]}...")

    # 7. 格式化輸出
    print("\n7. 格式化輸出:")

    # 美化輸出
    simple_html = "<div><p>Hello</p><p>World</p></div>"
    simple_soup = BeautifulSoup(simple_html, 'html.parser')

    print("原始HTML:")
    print(simple_html)

    print("\n美化後的HTML:")
    print(simple_soup.prettify())

    # 自定義縮進
    print("\n自定義縮進(2個空格):")
    print(simple_soup.prettify(indent="  "))

    # 8. 性能測試
    print("\n8. 性能測試:")

    import time

    # 測試不同解析器的性能
    test_html = html_content * 10  # 增大測試數據

    available_parsers = []
    for parser, _ in parsers:
        try:
            BeautifulSoup("<html></html>", parser)
            available_parsers.append(parser)
        except:
            continue

    print("解析器性能測試:")
    for parser in available_parsers:
        start_time = time.time()
        try:
            for _ in range(10):
                BeautifulSoup(test_html, parser)
            elapsed = time.time() - start_time
            print(f"  {parser}: {elapsed:.4f}秒 (10次解析)")
        except Exception as e:
            print(f"  {parser}: 測試失敗 - {e}")

# 運行BeautifulSoup基礎演示
if __name__ == "__main__":
    beautifulsoup_basics_demo()

終端日誌:

=== BeautifulSoup基礎功能演示 ===

1. 基本使用和解析器:
可用的解析器:
  ✓ html.parser: 內置解析器，速度適中，容錯性一般
  ✓ lxml: 速度最快，功能強大，需要安裝lxml庫
  ✓ html5lib: 最好的容錯性，解析方式與瀏覽器相同，速度較慢

2. 基本屬性和方法:
文檔類型: <class 'bs4.BeautifulSoup'>
解析器: <html.parser.HTMLParser object at 0x...>
文檔標題: <title>BeautifulSoup示例頁面</title>
標題文本: BeautifulSoup示例頁面
HTML標籤: html
所有文本長度: 385字符
文本預覽: BeautifulSoup示例頁面
            .highlight { color: red; }
            #main { background: #f0f0f0; }



            網頁解析示例
            這是一個用於演示BeautifulSoup功能的示例頁面。


                文章列表

                    Python基礎教程
                    網絡爬蟲入門
                    數據分析實戰



                相關鏈接
                Python官網
                Python文檔



                © 2024 示例網站





3. 標籤對象的屬性:
標籤名: div
標籤屬性: {'id': 'main', 'class': ['container']}
id屬性: main
class屬性: ['container']
是否有id屬性: True
是否有title屬性: False

4. 導航樹結構:
title標籤: <title>BeautifulSoup示例頁面</title>
父標籤: head
子元素數量: 1
h1標籤: <h1 class="title">網頁解析示例</h1>
下一個兄弟元素: p
p標籤的上一個兄弟: h1

5. 內容類型:
內容類型統計:
  Tag: 23
  NavigableString: 31
  Comment: 1

找到 1 個註釋:
  註釋: 這是一個註釋

6. 編碼處理:
檢測到的編碼: utf-8
UTF-8解析結果: 中文測試
轉爲UTF-8: b'<html><head><title>\xe4\xb8\xad\xe6\x96\x87\xe6\xb5\x8b\xe8\xaf\x95</title></head><body><p>\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c</p></body></html>'

7. 格式化輸出:
原始HTML:
<div><p>Hello</p><p>World</p></div>

美化後的HTML:
<div>
 <p>
  Hello
 </p>
 <p>
  World
 </p>
</div>

自定義縮進(2個空格):
<div>
  <p>
    Hello
  </p>
  <p>
    World
  </p>
</div>

8. 性能測試:
解析器性能測試:
  html.parser: 0.0156秒 (10次解析)
  lxml: 0.0089秒 (10次解析)
  html5lib: 0.0445秒 (10次解析)

HTML解析¶

BeautifulSoup提供了多種方法來查找和提取HTML元素。

def html_parsing_demo():
    """
    演示HTML解析功能
    """
    print("=== HTML解析功能演示 ===")

    # 獲取示例網頁
    try:
        response = requests.get('https://httpbin.org/html')
        soup = BeautifulSoup(response.text, 'html.parser')
        print("✓ 成功獲取示例網頁")
    except:
        # 如果無法獲取網頁，使用本地HTML
        html_content = """
        <!DOCTYPE html>
        <html>
        <head>
            <title>HTML解析示例</title>
            <meta name="description" content="這是一個HTML解析示例頁面">
            <meta name="keywords" content="HTML, 解析, BeautifulSoup">
        </head>
        <body>
            <header>
                <nav class="navbar">
                    <ul>
                        <li><a href="#home">首頁</a></li>
                        <li><a href="#about">關於</a></li>
                        <li><a href="#contact">聯繫</a></li>
                    </ul>
                </nav>
            </header>

            <main>
                <section id="hero" class="hero-section">
                    <h1>歡迎來到我的網站</h1>
                    <p class="lead">這裏有最新的技術文章和教程</p>
                    <button class="btn btn-primary" data-action="subscribe">訂閱更新</button>
                </section>

                <section id="articles" class="articles-section">
                    <h2>最新文章</h2>
                    <div class="article-grid">
                        <article class="article-card" data-category="python">
                            <h3><a href="/python-basics">Python基礎教程</a></h3>
                            <p class="excerpt">學習Python編程的基礎知識...</p>
                            <div class="meta">
                                <span class="author">作者: 張三</span>
                                <span class="date">2024-01-15</span>
                                <span class="tags">
                                    <span class="tag">Python</span>
                                    <span class="tag">編程</span>
                                </span>
                            </div>
                        </article>

                        <article class="article-card" data-category="web">
                            <h3><a href="/web-scraping">網絡爬蟲實戰</a></h3>
                            <p class="excerpt">使用Python進行網絡數據採集...</p>
                            <div class="meta">
                                <span class="author">作者: 李四</span>
                                <span class="date">2024-01-10</span>
                                <span class="tags">
                                    <span class="tag">爬蟲</span>
                                    <span class="tag">數據採集</span>
                                </span>
                            </div>
                        </article>

                        <article class="article-card" data-category="data">
                            <h3><a href="/data-analysis">數據分析入門</a></h3>
                            <p class="excerpt">掌握數據分析的基本方法...</p>
                            <div class="meta">
                                <span class="author">作者: 王五</span>
                                <span class="date">2024-01-05</span>
                                <span class="tags">
                                    <span class="tag">數據分析</span>
                                    <span class="tag">統計</span>
                                </span>
                            </div>
                        </article>
                    </div>
                </section>

                <aside class="sidebar">
                    <div class="widget">
                        <h4>熱門標籤</h4>
                        <div class="tag-cloud">
                            <a href="#" class="tag-link" data-count="15">Python</a>
                            <a href="#" class="tag-link" data-count="12">JavaScript</a>
                            <a href="#" class="tag-link" data-count="8">數據科學</a>
                            <a href="#" class="tag-link" data-count="6">機器學習</a>
                        </div>
                    </div>

                    <div class="widget">
                        <h4>友情鏈接</h4>
                        <ul class="link-list">
                            <li><a href="https://python.org" target="_blank" rel="noopener">Python官網</a></li>
                            <li><a href="https://github.com" target="_blank" rel="noopener">GitHub</a></li>
                            <li><a href="https://stackoverflow.com" target="_blank" rel="noopener">Stack Overflow</a></li>
                        </ul>
                    </div>
                </aside>
            </main>

            <footer>
                <div class="footer-content">
                    <p>&copy; 2024 我的網站. 保留所有權利.</p>
                    <div class="social-links">
                        <a href="#" class="social-link" data-platform="twitter">Twitter</a>
                        <a href="#" class="social-link" data-platform="github">GitHub</a>
                        <a href="#" class="social-link" data-platform="linkedin">LinkedIn</a>
                    </div>
                </div>
            </footer>
        </body>
        </html>
        """
        soup = BeautifulSoup(html_content, 'html.parser')
        print("✓ 使用本地HTML示例")

    # 1. 基本查找方法
    print("\n1. 基本查找方法:")

    # find() - 查找第一個匹配的元素
    first_h1 = soup.find('h1')
    print(f"第一個h1標籤: {first_h1}")

    # find_all() - 查找所有匹配的元素
    all_links = soup.find_all('a')
    print(f"所有鏈接數量: {len(all_links)}")

    # 限制查找數量
    first_3_links = soup.find_all('a', limit=3)
    print(f"前3個鏈接: {[link.get_text() for link in first_3_links]}")

    # 2. 按屬性查找
    print("\n2. 按屬性查找:")

    # 按class查找
    article_cards = soup.find_all('article', class_='article-card')
    print(f"文章卡片數量: {len(article_cards)}")

    # 按id查找
    hero_section = soup.find('section', id='hero')
    if hero_section:
        print(f"英雄區域標題: {hero_section.find('h1').get_text()}")

    # 按多個class查找
    btn_primary = soup.find('button', class_=['btn', 'btn-primary'])
    if btn_primary:
        print(f"主要按鈕: {btn_primary.get_text()}")

    # 按自定義屬性查找
    python_articles = soup.find_all('article', {'data-category': 'python'})
    print(f"Python分類文章: {len(python_articles)}")

    # 3. 使用正則表達式查找
    print("\n3. 使用正則表達式查找:")

    # 查找href包含特定模式的鏈接
    external_links = soup.find_all('a', href=re.compile(r'https?://'))
    print(f"外部鏈接數量: {len(external_links)}")
    for link in external_links:
        print(f"  {link.get_text()}: {link.get('href')}")

    # 查找class名包含特定模式的元素
    tag_elements = soup.find_all(class_=re.compile(r'tag'))
    print(f"\n包含'tag'的class元素: {len(tag_elements)}")

    # 4. 使用函數查找
    print("\n4. 使用函數查找:")

    def has_data_attribute(tag):
        """檢查標籤是否有data-*屬性"""
        return tag.has_attr('data-category') or tag.has_attr('data-action') or tag.has_attr('data-platform')

    data_elements = soup.find_all(has_data_attribute)
    print(f"有data屬性的元素: {len(data_elements)}")
    for elem in data_elements:
        data_attrs = {k: v for k, v in elem.attrs.items() if k.startswith('data-')}
        print(f"  {elem.name}: {data_attrs}")

    # 查找包含特定文本的元素
    def contains_python(tag):
        """檢查標籤文本是否包含'Python'"""
        return tag.string and 'Python' in tag.string

    python_texts = soup.find_all(string=contains_python)
    print(f"\n包含'Python'的文本: {python_texts}")

    # 5. 層級查找
    print("\n5. 層級查找:")

    # 查找直接子元素
    main_section = soup.find('main')
    if main_section:
        direct_children = main_section.find_all(recursive=False)
        print(f"main的直接子元素: {[child.name for child in direct_children if child.name]}")

    # 查找後代元素
    nav_links = soup.find('nav').find_all('a') if soup.find('nav') else []
    print(f"導航鏈接: {[link.get_text() for link in nav_links]}")

    # 6. 兄弟元素查找
    print("\n6. 兄弟元素查找:")

    # 查找下一個兄弟元素
    first_article = soup.find('article')
    if first_article:
        next_article = first_article.find_next_sibling('article')
        if next_article:
            next_title = next_article.find('h3').get_text()
            print(f"下一篇文章: {next_title}")

    # 查找所有後續兄弟元素
    all_next_articles = first_article.find_next_siblings('article') if first_article else []
    print(f"後續文章數量: {len(all_next_articles)}")

    # 7. 父元素查找
    print("\n7. 父元素查找:")

    # 查找特定鏈接的父元素
    python_link = soup.find('a', string='Python基礎教程')
    if python_link:
        article_parent = python_link.find_parent('article')
        if article_parent:
            category = article_parent.get('data-category')
            print(f"Python教程文章分類: {category}")

    # 查找所有祖先元素
    if python_link:
        parents = [parent.name for parent in python_link.find_parents() if parent.name]
        print(f"Python鏈接的祖先元素: {parents}")

    # 8. 複雜查找組合
    print("\n8. 複雜查找組合:")

    # 查找包含特定文本的鏈接
    tutorial_links = soup.find_all('a', string=re.compile(r'教程|實戰|入門'))
    print(f"教程相關鏈接: {[link.get_text() for link in tutorial_links]}")

    # 查找特定結構的元素
    articles_with_tags = []
    for article in soup.find_all('article'):
        tags_container = article.find('span', class_='tags')
        if tags_container:
            tags = [tag.get_text() for tag in tags_container.find_all('span', class_='tag')]
            title = article.find('h3').get_text() if article.find('h3') else 'Unknown'
            articles_with_tags.append({'title': title, 'tags': tags})

    print(f"\n文章標籤信息:")
    for article_info in articles_with_tags:
        print(f"  {article_info['title']}: {article_info['tags']}")

    # 9. 性能優化技巧
    print("\n9. 性能優化技巧:")

    import time

    # 比較不同查找方法的性能
    test_iterations = 1000

    # 方法1: 使用find_all
    start_time = time.time()
    for _ in range(test_iterations):
        soup.find_all('a')
    method1_time = time.time() - start_time

    # 方法2: 使用CSS選擇器
    start_time = time.time()
    for _ in range(test_iterations):
        soup.select('a')
    method2_time = time.time() - start_time

    print(f"性能比較 ({test_iterations}次查找):")
    print(f"  find_all方法: {method1_time:.4f}秒")
    print(f"  CSS選擇器: {method2_time:.4f}秒")

    # 10. 錯誤處理和邊界情況
    print("\n10. 錯誤處理和邊界情況:")

    # 處理不存在的元素
    non_existent = soup.find('nonexistent')
    print(f"不存在的元素: {non_existent}")

    # 安全獲取屬性
    safe_href = soup.find('a').get('href', '默認值') if soup.find('a') else '無鏈接'
    print(f"安全獲取href: {safe_href}")

    # 處理空文本
    empty_elements = soup.find_all(string=lambda text: text and text.strip() == '')
    print(f"空文本元素數量: {len(empty_elements)}")

    # 檢查元素是否存在再操作
    meta_description = soup.find('meta', attrs={'name': 'description'})
    if meta_description:
        description_content = meta_description.get('content')
        print(f"頁面描述: {description_content}")
    else:
        print("未找到頁面描述")

# 運行HTML解析演示
if __name__ == "__main__":
    html_parsing_demo()

終端日誌:

=== HTML解析功能演示 ===
✓ 使用本地HTML示例

1. 基本查找方法:
第一個h1標籤: <h1>歡迎來到我的網站</h1>
所有鏈接數量: 9
前3個鏈接: ['首頁', '關於', '聯繫']

2. 按屬性查找:
文章卡片數量: 3
英雄區域標題: 歡迎來到我的網站
主要按鈕: 訂閱更新
Python分類文章: 1

3. 使用正則表達式查找:
外部鏈接數量: 3
  Python官網: https://python.org
  GitHub: https://github.com
  Stack Overflow: https://stackoverflow.com

包含'tag'的class元素: 10

4. 使用函數查找:
有data屬性的元素: 7
  button: {'data-action': 'subscribe'}
  article: {'data-category': 'python'}
  article: {'data-category': 'web'}
  article: {'data-category': 'data'}
  a: {'data-platform': 'twitter'}
  a: {'data-platform': 'github'}
  a: {'data-platform': 'linkedin'}

包含'Python'的文本: ['Python', 'Python基礎教程']

5. 層級查找:
main的直接子元素: ['section', 'section', 'aside']
導航鏈接: ['首頁', '關於', '聯繫']

6. 兄弟元素查找:
下一篇文章: 網絡爬蟲實戰
後續文章數量: 2

7. 父元素查找:
Python教程文章分類: python
Python鏈接的祖先元素: ['h3', 'article', 'div', 'section', 'main', 'body', 'html', '[document]']

8. 複雜查找組合:
教程相關鏈接: ['Python基礎教程', '數據分析入門']

文章標籤信息:
  Python基礎教程: ['Python', '編程']
  網絡爬蟲實戰: ['爬蟲', '數據採集']
  數據分析入門: ['數據分析', '統計']

9. 性能比較 (1000次查找):
  find_all方法: 0.0234秒
  CSS選擇器: 0.0189秒

10. 錯誤處理和邊界情況:
不存在的元素: None
安全獲取href: #home
空文本元素數量: 0
頁面描述: 這是一個HTML解析示例頁面

CSS選擇器¶

BeautifulSoup支持CSS選擇器，提供了更靈活的元素選擇方式。

def css_selector_demo():
    """
    演示CSS選擇器功能
    """
    print("=== CSS選擇器功能演示 ===")

    # 示例HTML
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>CSS選擇器示例</title>
    </head>
    <body>
        <div id="container" class="main-content">
            <header class="site-header">
                <h1 class="site-title">我的博客</h1>
                <nav class="main-nav">
                    <ul>
                        <li class="nav-item active"><a href="/">首頁</a></li>
                        <li class="nav-item"><a href="/about">關於</a></li>
                        <li class="nav-item"><a href="/contact">聯繫</a></li>
                    </ul>
                </nav>
            </header>

            <main class="content">
                <article class="post featured" data-category="tech">
                    <h2 class="post-title">Python爬蟲技術詳解</h2>
                    <div class="post-meta">
                        <span class="author">作者: 張三</span>
                        <span class="date">2024-01-15</span>
                        <div class="tags">
                            <span class="tag python">Python</span>
                            <span class="tag web-scraping">爬蟲</span>
                        </div>
                    </div>
                    <div class="post-content">
                        <p>這是一篇關於Python爬蟲的詳細教程...</p>
                        <ul class="feature-list">
                            <li>基礎概念介紹</li>
                            <li>實戰案例分析</li>
                            <li>最佳實踐分享</li>
                        </ul>
                    </div>
                </article>

                <article class="post" data-category="tutorial">
                    <h2 class="post-title">Web開發入門指南</h2>
                    <div class="post-meta">
                        <span class="author">作者: 李四</span>
                        <span class="date">2024-01-10</span>
                        <div class="tags">
                            <span class="tag html">HTML</span>
                            <span class="tag css">CSS</span>
                            <span class="tag javascript">JavaScript</span>
                        </div>
                    </div>
                    <div class="post-content">
                        <p>學習Web開發的完整路徑...</p>
                        <ol class="step-list">
                            <li>HTML基礎</li>
                            <li>CSS樣式</li>
                            <li>JavaScript交互</li>
                        </ol>
                    </div>
                </article>
            </main>

            <aside class="sidebar">
                <div class="widget recent-posts">
                    <h3 class="widget-title">最新文章</h3>
                    <ul class="post-list">
                        <li><a href="/post1">文章標題1</a></li>
                        <li><a href="/post2">文章標題2</a></li>
                        <li><a href="/post3">文章標題3</a></li>
                    </ul>
                </div>

                <div class="widget categories">
                    <h3 class="widget-title">分類</h3>
                    <ul class="category-list">
                        <li><a href="/category/tech" data-count="5">技術 (5)</a></li>
                        <li><a href="/category/tutorial" data-count="3">教程 (3)</a></li>
                        <li><a href="/category/news" data-count="2">新聞 (2)</a></li>
                    </ul>
                </div>
            </aside>
        </div>

        <footer class="site-footer">
            <div class="footer-content">
                <p>&copy; 2024 我的博客. 版權所有.</p>
                <div class="social-links">
                    <a href="#" class="social twitter" title="Twitter">Twitter</a>
                    <a href="#" class="social github" title="GitHub">GitHub</a>
                    <a href="#" class="social linkedin" title="LinkedIn">LinkedIn</a>
                </div>
            </div>
        </footer>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    # 1. 基本選擇器
    print("\n1. 基本選擇器:")

    # 標籤選擇器
    h1_tags = soup.select('h1')
    print(f"h1標籤: {[h1.get_text() for h1 in h1_tags]}")

    # 類選擇器
    post_titles = soup.select('.post-title')
    print(f"文章標題: {[title.get_text() for title in post_titles]}")

    # ID選擇器
    container = soup.select('#container')
    print(f"容器元素: {len(container)}個")

    # 屬性選擇器
    tech_posts = soup.select('[data-category="tech"]')
    print(f"技術分類文章: {len(tech_posts)}個")

    # 2. 組合選擇器
    print("\n2. 組合選擇器:")

    # 後代選擇器
    nav_links = soup.select('nav a')
    print(f"導航鏈接: {[link.get_text() for link in nav_links]}")

    # 子選擇器
    direct_children = soup.select('main > article')
    print(f"main的直接子文章: {len(direct_children)}個")

    # 相鄰兄弟選擇器
    next_siblings = soup.select('h2 + .post-meta')
    print(f"h2後的meta信息: {len(next_siblings)}個")

    # 通用兄弟選擇器
    all_siblings = soup.select('h2 ~ div')
    print(f"h2後的所有div: {len(all_siblings)}個")

    # 3. 僞類選擇器
    print("\n3. 僞類選擇器:")

    # 第一個子元素
    first_children = soup.select('ul li:first-child')
    print(f"列表第一項: {[li.get_text() for li in first_children]}")

    # 最後一個子元素
    last_children = soup.select('ul li:last-child')
    print(f"列表最後一項: {[li.get_text() for li in last_children]}")

    # 第n個子元素
    second_items = soup.select('ul li:nth-child(2)')
    print(f"列表第二項: {[li.get_text() for li in second_items]}")

    # 奇數/偶數子元素
    odd_items = soup.select('ul li:nth-child(odd)')
    print(f"奇數位置項目: {len(odd_items)}個")

    # 4. 屬性選擇器高級用法
    print("\n4. 屬性選擇器高級用法:")

    # 包含特定屬性
    has_title = soup.select('[title]')
    print(f"有title屬性的元素: {len(has_title)}個")

    # 屬性值開頭匹配
    href_starts = soup.select('a[href^="/category"]')
    print(f"href以/category開頭的鏈接: {len(href_starts)}個")

    # 屬性值結尾匹配
    href_ends = soup.select('a[href$=".html"]')
    print(f"href以.html結尾的鏈接: {len(href_ends)}個")

    # 屬性值包含匹配
    href_contains = soup.select('a[href*="post"]')
    print(f"href包含post的鏈接: {len(href_contains)}個")

    # 屬性值單詞匹配
    class_word = soup.select('[class~="post"]')
    print(f"class包含post單詞的元素: {len(class_word)}個")

    # 5. 多重選擇器
    print("\n5. 多重選擇器:")

    # 並集選擇器
    headings = soup.select('h1, h2, h3')
    print(f"所有標題: {[h.get_text() for h in headings]}")

    # 複雜組合
    featured_tags = soup.select('article.featured .tag')
    print(f"特色文章標籤: {[tag.get_text() for tag in featured_tags]}")

    # 6. 否定選擇器
    print("\n6. 否定選擇器:")

    # 不包含特定class的元素
    non_featured = soup.select('article:not(.featured)')
    print(f"非特色文章: {len(non_featured)}個")

    # 不是第一個子元素
    not_first = soup.select('li:not(:first-child)')
    print(f"非第一個li元素: {len(not_first)}個")

    # 7. 文本內容選擇
    print("\n7. 文本內容選擇:")

    # 使用contains選擇器（BeautifulSoup特有）
    # 注意：標準CSS不支持文本內容選擇，這是BeautifulSoup的擴展

    # 查找包含特定文本的元素
    python_elements = soup.find_all(string=re.compile('Python'))
    print(f"包含Python的文本: {len(python_elements)}個")

    # 8. 性能比較
    print("\n8. 性能比較:")

    import time

    test_iterations = 1000

    # CSS選擇器
    start_time = time.time()
    for _ in range(test_iterations):
        soup.select('.post-title')
    css_time = time.time() - start_time

    # find_all方法
    start_time = time.time()
    for _ in range(test_iterations):
        soup.find_all(class_='post-title')
    find_time = time.time() - start_time

    print(f"性能測試 ({test_iterations}次):")
    print(f"  CSS選擇器: {css_time:.4f}秒")
    print(f"  find_all方法: {find_time:.4f}秒")

    # 9. 實用選擇器示例
    print("\n9. 實用選擇器示例:")

    # 選擇所有外部鏈接
    external_links = soup.select('a[href^="http"]')
    print(f"外部鏈接: {len(external_links)}個")

    # 選擇所有圖片
    images = soup.select('img')
    print(f"圖片: {len(images)}個")

    # 選擇表單元素
    form_elements = soup.select('input, textarea, select')
    print(f"表單元素: {len(form_elements)}個")

    # 選擇有特定數據屬性的元素
    data_elements = soup.select('[data-count]')
    print(f"有data-count屬性的元素: {len(data_elements)}個")
    for elem in data_elements:
        print(f"  {elem.get_text()}: {elem.get('data-count')}")

    # 10. 複雜查詢示例
    print("\n10. 複雜查詢示例:")

    # 查找特定結構的數據
    articles_info = []
    for article in soup.select('article'):
        title = article.select_one('.post-title')
        author = article.select_one('.author')
        date = article.select_one('.date')
        tags = article.select('.tag')

        if title:
            article_data = {
                'title': title.get_text(),
                'author': author.get_text() if author else 'Unknown',
                'date': date.get_text() if date else 'Unknown',
                'tags': [tag.get_text() for tag in tags],
                'category': article.get('data-category', 'Unknown')
            }
            articles_info.append(article_data)

    print("文章詳細信息:")
    for info in articles_info:
        print(f"  標題: {info['title']}")
        print(f"  作者: {info['author']}")
        print(f"  日期: {info['date']}")
        print(f"  分類: {info['category']}")
        print(f"  標籤: {', '.join(info['tags'])}")
        print()

# 運行CSS選擇器演示
if __name__ == "__main__":
    css_selector_demo()

終端日誌:

=== CSS選擇器功能演示 ===

1. 基本選擇器:
h1標籤: ['我的博客']
文章標題: ['Python爬蟲技術詳解', 'Web開發入門指南']
容器元素: 1個
技術分類文章: 1個

2. 組合選擇器:
導航鏈接: ['首頁', '關於', '聯繫']
main的直接子文章: 2個
h2後的meta信息: 2個
h2後的所有div: 4個

3. 僞類選擇器:
列表第一項: ['首頁', '基礎概念介紹', 'HTML基礎', '文章標題1', '技術 (5)']
列表最後一項: ['聯繫', '最佳實踐分享', 'JavaScript交互', '文章標題3', '新聞 (2)']
列表第二項: ['關於', '實戰案例分析', 'CSS樣式', '文章標題2', '教程 (3)']
奇數位置項目: 8個

4. 屬性選擇器高級用法:
有title屬性的元素: 3個
href以/category開頭的鏈接: 3個
href以.html結尾的鏈接: 0個
href包含post的鏈接: 3個
class包含post單詞的元素: 4個

5. 多重選擇器:
所有標題: ['我的博客', 'Python爬蟲技術詳解', 'Web開發入門指南', '最新文章', '分類']
特色文章標籤: ['Python', '爬蟲']

6. 否定選擇器:
非特色文章: 1個
非第一個li元素: 10個

7. 文本內容選擇:
包含Python的文本: 2個

8. 性能比較 (1000次):
  CSS選擇器: 0.0156秒
  find_all方法: 0.0189秒

9. 實用選擇器示例:
外部鏈接: 0個
圖片: 0個
表單元素: 0個
有data-count屬性的元素: 3個
  技術 (5): 5
  教程 (3): 3
  新聞 (2): 2

10. 複雜查詢示例:
文章詳細信息:
  標題: Python爬蟲技術詳解
  作者: 作者: 張三
  日期: 2024-01-15
  分類: tech
  標籤: Python, 爬蟲

  標題: Web開發入門指南
  作者: 作者: 李四
  日期: 2024-01-10
  分類: tutorial
  標籤: HTML, CSS, JavaScript

數據提取¶

BeautifulSoup提供了多種方法來提取HTML元素中的數據。

def data_extraction_demo():
    """
    演示數據提取功能
    """
    print("=== 數據提取功能演示 ===")

    # 示例HTML - 電商產品頁面
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>商品詳情 - Python編程書籍</title>
        <meta name="description" content="Python從入門到精通，適合初學者的編程教程">
        <meta name="keywords" content="Python, 編程, 教程, 書籍">
        <meta name="price" content="89.00">
    </head>
    <body>
        <div class="product-page">
            <header class="page-header">
                <nav class="breadcrumb">
                    <a href="/">首頁</a> > 
                    <a href="/books">圖書</a> > 
                    <a href="/books/programming">編程</a> > 
                    <span class="current">Python從入門到精通</span>
                </nav>
            </header>

            <main class="product-main">
                <div class="product-gallery">
                    <img src="/images/python-book-cover.jpg" alt="Python從入門到精通封面" class="main-image">
                    <div class="thumbnail-list">
                        <img src="/images/python-book-thumb1.jpg" alt="縮略圖1" class="thumbnail">
                        <img src="/images/python-book-thumb2.jpg" alt="縮略圖2" class="thumbnail">
                        <img src="/images/python-book-thumb3.jpg" alt="縮略圖3" class="thumbnail">
                    </div>
                </div>

                <div class="product-info">
                    <h1 class="product-title">Python從入門到精通（第3版）</h1>
                    <div class="product-subtitle">零基礎學Python，包含大量實戰案例</div>

                    <div class="rating-section">
                        <div class="stars" data-rating="4.5">
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star half">☆</span>
                        </div>
                        <span class="rating-text">4.5分</span>
                        <a href="#reviews" class="review-count">(1,234條評價)</a>
                    </div>

                    <div class="price-section">
                        <span class="current-price" data-price="89.00">¥89.00</span>
                        <span class="original-price" data-original="128.00">¥128.00</span>
                        <span class="discount">7折</span>
                        <div class="price-note">包郵 | 30天無理由退換</div>
                    </div>

                    <div class="product-specs">
                        <table class="specs-table">
                            <tr>
                                <td class="spec-name">作者</td>
                                <td class="spec-value">張三, 李四</td>
                            </tr>
                            <tr>
                                <td class="spec-name">出版社</td>
                                <td class="spec-value">人民郵電出版社</td>
                            </tr>
                            <tr>
                                <td class="spec-name">出版時間</td>
                                <td class="spec-value">2024年1月</td>
                            </tr>
                            <tr>
                                <td class="spec-name">頁數</td>
                                <td class="spec-value">568頁</td>
                            </tr>
                            <tr>
                                <td class="spec-name">ISBN</td>
                                <td class="spec-value">978-7-115-12345-6</td>
                            </tr>
                            <tr>
                                <td class="spec-name">重量</td>
                                <td class="spec-value">0.8kg</td>
                            </tr>
                        </table>
                    </div>

                    <div class="action-buttons">
                        <button class="btn btn-primary add-to-cart" data-product-id="12345">加入購物車</button>
                        <button class="btn btn-secondary buy-now" data-product-id="12345">立即購買</button>
                        <button class="btn btn-outline favorite" data-product-id="12345">收藏</button>
                    </div>
                </div>
            </main>

            <section class="product-details">
                <div class="tabs">
                    <div class="tab active" data-tab="description">商品描述</div>
                    <div class="tab" data-tab="contents">目錄</div>
                    <div class="tab" data-tab="reviews">用戶評價</div>
                </div>

                <div class="tab-content active" id="description">
                    <div class="description-text">
                        <p>本書是Python編程的入門經典教程，適合零基礎讀者學習。</p>
                        <p>全書共分爲15個章節，涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。</p>
                        <ul class="feature-list">
                            <li>✓ 零基礎入門，循序漸進</li>
                            <li>✓ 大量實戰案例，學以致用</li>
                            <li>✓ 配套視頻教程，立體學習</li>
                            <li>✓ 技術社區支持，答疑解惑</li>
                        </ul>
                    </div>
                </div>

                <div class="tab-content" id="contents">
                    <div class="contents-list">
                        <div class="chapter">
                            <h3>第1章 Python基礎</h3>
                            <ul>
                                <li>1.1 Python簡介</li>
                                <li>1.2 開發環境搭建</li>
                                <li>1.3 第一個Python程序</li>
                            </ul>
                        </div>
                        <div class="chapter">
                            <h3>第2章 數據類型</h3>
                            <ul>
                                <li>2.1 數字類型</li>
                                <li>2.2 字符串</li>
                                <li>2.3 列表和元組</li>
                            </ul>
                        </div>
                        <!-- 更多章節... -->
                    </div>
                </div>

                <div class="tab-content" id="reviews">
                    <div class="reviews-summary">
                        <div class="rating-breakdown">
                            <div class="rating-bar">
                                <span class="stars">5星</span>
                                <div class="bar"><div class="fill" style="width: 60%"></div></div>
                                <span class="count">740</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">4星</span>
                                <div class="bar"><div class="fill" style="width: 25%"></div></div>
                                <span class="count">309</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">3星</span>
                                <div class="bar"><div class="fill" style="width: 10%"></div></div>
                                <span class="count">123</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">2星</span>
                                <div class="bar"><div class="fill" style="width: 3%"></div></div>
                                <span class="count">37</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">1星</span>
                                <div class="bar"><div class="fill" style="width: 2%"></div></div>
                                <span class="count">25</span>
                            </div>
                        </div>
                    </div>

                    <div class="reviews-list">
                        <div class="review" data-rating="5">
                            <div class="review-header">
                                <span class="reviewer">Python學習者</span>
                                <div class="review-stars">★★★★★</div>
                                <span class="review-date">2024-01-15</span>
                            </div>
                            <div class="review-content">
                                <p>非常好的Python入門書籍，內容詳實，案例豐富。作爲零基礎學習者，我能夠很好地理解書中的內容。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="23">有用 (23)</button>
                            </div>
                        </div>

                        <div class="review" data-rating="4">
                            <div class="review-header">
                                <span class="reviewer">編程新手</span>
                                <div class="review-stars">★★★★☆</div>
                                <span class="review-date">2024-01-10</span>
                            </div>
                            <div class="review-content">
                                <p>書的質量不錯，內容也比較全面。就是有些地方講解得不夠深入，需要結合其他資料學習。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="15">有用 (15)</button>
                            </div>
                        </div>

                        <div class="review" data-rating="5">
                            <div class="review-header">
                                <span class="reviewer">技術愛好者</span>
                                <div class="review-stars">★★★★★</div>
                                <span class="review-date">2024-01-08</span>
                            </div>
                            <div class="review-content">
                                <p>推薦給所有想學Python的朋友！書中的實戰項目很有意思，跟着做完後收穫很大。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="31">有用 (31)</button>
                            </div>
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    # 1. 基本文本提取
    print("\n1. 基本文本提取:")

    # 提取標題
    title = soup.find('h1', class_='product-title')
    print(f"商品標題: {title.get_text() if title else 'N/A'}")

    # 提取副標題
    subtitle = soup.find('div', class_='product-subtitle')
    print(f"商品副標題: {subtitle.get_text() if subtitle else 'N/A'}")

    # 提取價格信息
    current_price = soup.find('span', class_='current-price')
    original_price = soup.find('span', class_='original-price')
    discount = soup.find('span', class_='discount')

    print(f"當前價格: {current_price.get_text() if current_price else 'N/A'}")
    print(f"原價: {original_price.get_text() if original_price else 'N/A'}")
    print(f"折扣: {discount.get_text() if discount else 'N/A'}")

    # 2. 屬性值提取
    print("\n2. 屬性值提取:")

    # 提取數據屬性
    rating_element = soup.find('div', class_='stars')
    if rating_element:
        rating = rating_element.get('data-rating')
        print(f"評分: {rating}")

    # 提取價格數據屬性
    if current_price:
        price_value = current_price.get('data-price')
        print(f"價格數值: {price_value}")

    # 提取產品ID
    add_to_cart_btn = soup.find('button', class_='add-to-cart')
    if add_to_cart_btn:
        product_id = add_to_cart_btn.get('data-product-id')
        print(f"產品ID: {product_id}")

    # 提取圖片信息
    main_image = soup.find('img', class_='main-image')
    if main_image:
        img_src = main_image.get('src')
        img_alt = main_image.get('alt')
        print(f"主圖片: {img_src}, 描述: {img_alt}")

    # 3. 表格數據提取
    print("\n3. 表格數據提取:")

    specs_table = soup.find('table', class_='specs-table')
    if specs_table:
        specs = {}
        rows = specs_table.find_all('tr')
        for row in rows:
            name_cell = row.find('td', class_='spec-name')
            value_cell = row.find('td', class_='spec-value')
            if name_cell and value_cell:
                specs[name_cell.get_text()] = value_cell.get_text()

        print("商品規格:")
        for key, value in specs.items():
            print(f"  {key}: {value}")

    # 4. 列表數據提取
    print("\n4. 列表數據提取:")

    # 提取麪包屑導航
    breadcrumb = soup.find('nav', class_='breadcrumb')
    if breadcrumb:
        links = breadcrumb.find_all('a')
        current = breadcrumb.find('span', class_='current')

        breadcrumb_path = [link.get_text() for link in links]
        if current:
            breadcrumb_path.append(current.get_text())

        print(f"導航路徑: {' > '.join(breadcrumb_path)}")

    # 提取特性列表
    feature_list = soup.find('ul', class_='feature-list')
    if feature_list:
        features = [li.get_text().strip() for li in feature_list.find_all('li')]
        print(f"產品特性: {features}")

    # 5. 複雜結構數據提取
    print("\n5. 複雜結構數據提取:")

    # 提取評價信息
    reviews = []
    review_elements = soup.find_all('div', class_='review')

    for review_elem in review_elements:
        reviewer = review_elem.find('span', class_='reviewer')
        rating_stars = review_elem.find('div', class_='review-stars')
        date = review_elem.find('span', class_='review-date')
        content = review_elem.find('div', class_='review-content')
        helpful_btn = review_elem.find('button', class_='helpful-btn')

        review_data = {
            'reviewer': reviewer.get_text() if reviewer else 'Anonymous',
            'rating': review_elem.get('data-rating') if review_elem.has_attr('data-rating') else 'N/A',
            'date': date.get_text() if date else 'N/A',
            'content': content.get_text().strip() if content else 'N/A',
            'helpful_count': helpful_btn.get('data-count') if helpful_btn else '0'
        }
        reviews.append(review_data)

    print(f"用戶評價 ({len(reviews)}條):")
    for i, review in enumerate(reviews, 1):
        print(f"  評價{i}:")
        print(f"    用戶: {review['reviewer']}")
        print(f"    評分: {review['rating']}星")
        print(f"    日期: {review['date']}")
        print(f"    內容: {review['content'][:50]}...")
        print(f"    有用數: {review['helpful_count']}")
        print()

    # 6. 評分統計提取
    print("\n6. 評分統計提取:")

    rating_bars = soup.find_all('div', class_='rating-bar')
    rating_stats = {}

    for bar in rating_bars:
        stars = bar.find('span', class_='stars')
        count = bar.find('span', class_='count')
        fill_elem = bar.find('div', class_='fill')

        if stars and count:
            star_level = stars.get_text()
            count_num = count.get_text()
            percentage = '0%'

            if fill_elem and fill_elem.has_attr('style'):
                style = fill_elem.get('style')
                # 提取width百分比
                import re
                width_match = re.search(r'width:\s*(\d+%)', style)
                if width_match:
                    percentage = width_match.group(1)

            rating_stats[star_level] = {
                'count': count_num,
                'percentage': percentage
            }

    print("評分分佈:")
    for star_level, stats in rating_stats.items():
        print(f"  {star_level}: {stats['count']}條 ({stats['percentage']})")

    # 7. 文本清理和格式化
    print("\n7. 文本清理和格式化:")

    # 提取並清理描述文本
    description = soup.find('div', class_='description-text')
    if description:
        # 獲取純文本，去除HTML標籤
        clean_text = description.get_text(separator=' ', strip=True)
        print(f"商品描述: {clean_text[:100]}...")

        # 提取段落
        paragraphs = [p.get_text().strip() for p in description.find_all('p')]
        print(f"描述段落數: {len(paragraphs)}")

    # 8. 條件提取
    print("\n8. 條件提取:")

    # 提取高評分評價
    high_rating_reviews = soup.find_all('div', class_='review', attrs={'data-rating': lambda x: x and int(x) >= 4})
    print(f"高評分評價數量: {len(high_rating_reviews)}")

    # 提取有用評價（有用數>20）
    useful_reviews = []
    for review in soup.find_all('div', class_='review'):
        helpful_btn = review.find('button', class_='helpful-btn')
        if helpful_btn:
            count = helpful_btn.get('data-count')
            if count and int(count) > 20:
                reviewer = review.find('span', class_='reviewer')
                useful_reviews.append(reviewer.get_text() if reviewer else 'Anonymous')

    print(f"有用評價用戶: {useful_reviews}")

    # 9. 數據驗證和錯誤處理
    print("\n9. 數據驗證和錯誤處理:")

    # 安全提取價格
    def safe_extract_price(element):
        if not element:
            return None

        price_text = element.get_text().strip()
        # 提取數字
        import re
        price_match = re.search(r'([\d.]+)', price_text)
        if price_match:
            try:
                return float(price_match.group(1))
            except ValueError:
                return None
        return None

    current_price_value = safe_extract_price(current_price)
    original_price_value = safe_extract_price(original_price)

    print(f"當前價格數值: {current_price_value}")
    print(f"原價數值: {original_price_value}")

    if current_price_value and original_price_value:
        savings = original_price_value - current_price_value
        discount_percent = (savings / original_price_value) * 100
        print(f"節省金額: ¥{savings:.2f}")
        print(f"折扣百分比: {discount_percent:.1f}%")

    # 10. 綜合數據結構
    print("\n10. 綜合數據結構:")

    # 構建完整的產品數據結構
    product_data = {
        'basic_info': {
            'title': title.get_text() if title else None,
            'subtitle': subtitle.get_text() if subtitle else None,
            'product_id': product_id if 'product_id' in locals() else None
        },
        'pricing': {
            'current_price': current_price_value,
            'original_price': original_price_value,
            'discount_text': discount.get_text() if discount else None
        },
        'rating': {
            'score': rating if 'rating' in locals() else None,
            'total_reviews': len(reviews),
            'rating_distribution': rating_stats
        },
        'specifications': specs if 'specs' in locals() else {},
        'features': features if 'features' in locals() else [],
        'reviews_sample': reviews[:2]  # 只保留前兩條評價作爲示例
    }

    print("產品數據結構:")
    import json
    print(json.dumps(product_data, ensure_ascii=False, indent=2))

# 運行數據提取演示
if __name__ == "__main__":
    data_extraction_demo()

終端日誌:

=== 數據提取功能演示 ===

1. 基本文本提取:
商品標題: Python從入門到精通（第3版）
商品副標題: 零基礎學Python，包含大量實戰案例
當前價格: ¥89.00
原價: ¥128.00
折扣: 7折

2. 屬性值提取:
評分: 4.5
價格數值: 89.00
產品ID: 12345
主圖片: /images/python-book-cover.jpg, 描述: Python從入門到精通封面

3. 表格數據提取:
商品規格:
  作者: 張三, 李四
  出版社: 人民郵電出版社
  出版時間: 2024年1月
  頁數: 568頁
  ISBN: 978-7-115-12345-6
  重量: 0.8kg

4. 列表數據提取:
導航路徑: 首頁 > 圖書 > 編程 > Python從入門到精通
產品特性: ['✓ 零基礎入門，循序漸進', '✓ 大量實戰案例，學以致用', '✓ 配套視頻教程，立體學習', '✓ 技術社區支持，答疑解惑']

5. 複雜結構數據提取:
用戶評價 (3條):
  評價1:
    用戶: Python學習者
    評分: 5星
    日期: 2024-01-15
    內容: 非常好的Python入門書籍，內容詳實，案例豐富。作爲零基礎學習者，我能夠很好地理解書中的內容。...
    有用數: 23

  評價2:
    用戶: 編程新手
    評分: 4星
    日期: 2024-01-10
    內容: 書的質量不錯，內容也比較全面。就是有些地方講解得不夠深入，需要結合其他資料學習。...
    有用數: 15

  評價3:
    用戶: 技術愛好者
    評分: 5星
    日期: 2024-01-08
    內容: 推薦給所有想學Python的朋友！書中的實戰項目很有意思，跟着做完後收穫很大。...
    有用數: 31

6. 評分統計提取:
評分分佈:
  5星: 740條 (60%)
  4星: 309條 (25%)
  3星: 123條 (10%)
  2星: 37條 (3%)
  1星: 25條 (2%)

7. 文本清理和格式化:
商品描述: 本書是Python編程的入門經典教程，適合零基礎讀者學習。 全書共分爲15個章節，涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。 ✓ 零基礎入門，循序漸進 ✓ 大量實戰案例，學以致用 ✓ 配套視頻教程，立體學習 ✓ 技術社區支持，答疑解惑...
描述段落數: 2

8. 條件提取:
高評分評價數量: 2
有用評價用戶: ['Python學習者', '技術愛好者']

9. 數據驗證和錯誤處理:
當前價格數值: 89.0
原價數值: 128.0
節省金額: ¥39.00
折扣百分比: 30.5%

10. 綜合數據結構:
產品數據結構:
{
  "basic_info": {
    "title": "Python從入門到精通（第3版）",
    "subtitle": "零基礎學Python，包含大量實戰案例",
    "product_id": "12345"
  },
  "pricing": {
    "current_price": 89.0,
    "original_price": 128.0,
    "discount_text": "7折"
  },
  "rating": {
    "score": "4.5",
    "total_reviews": 3,
    "rating_distribution": {
      "5星": {
        "count": "740",
        "percentage": "60%"
      },
      "4星": {
        "count": "309",
        "percentage": "25%"
      },
      "3星": {
        "count": "123",
        "percentage": "10%"
      },
      "2星": {
        "count": "37",
        "percentage": "3%"
      },
      "1星": {
        "count": "25",
        "percentage": "2%"
      }
    }
  },
  "specifications": {
    "作者": "張三, 李四",
    "出版社": "人民郵電出版社",
    "出版時間": "2024年1月",
    "頁數": "568頁",
    "ISBN": "978-7-115-12345-6",
    "重量": "0.8kg"
  },
  "features": [
    "✓ 零基礎入門，循序漸進",
    "✓ 大量實戰案例，學以致用",
    "✓ 配套視頻教程，立體學習",
    "✓ 技術社區支持，答疑解惑"
  ],
  "reviews_sample": [
    {
      "reviewer": "Python學習者",
      "rating": "5",
      "date": "2024-01-15",
      "content": "非常好的Python入門書籍，內容詳實，案例豐富。作爲零基礎學習者，我能夠很好地理解書中的內容。",
      "helpful_count": "23"
    },
    {
      "reviewer": "編程新手",
      "rating": "4",
      "date": "2024-01-10",
      "content": "書的質量不錯，內容也比較全面。就是有些地方講解得不夠深入，需要結合其他資料學習。",
      "helpful_count": "15"
    }
  ]
}

高級操作¶

文檔修改¶

BeautifulSoup不僅可以解析HTML，還可以修改文檔結構。

def document_modification_demo():
    """
    演示文檔修改功能
    """
    print("=== 文檔修改功能演示 ===")

    # 示例HTML - 簡單的博客文章
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>我的博客文章</title>
        <meta name="author" content="原作者">
    </head>
    <body>
        <div class="container">
            <header>
                <h1>Python學習筆記</h1>
                <p class="meta">發佈時間: 2024-01-01</p>
            </header>

            <main class="content">
                <section class="intro">
                    <h2>簡介</h2>
                    <p>這是一篇關於Python基礎的文章。</p>
                </section>

                <section class="topics">
                    <h2>主要內容</h2>
                    <ul id="topic-list">
                        <li>變量和數據類型</li>
                        <li>控制結構</li>
                    </ul>
                </section>

                <section class="examples">
                    <h2>代碼示例</h2>
                    <div class="code-block">
                        <pre><code>print("Hello, World!")</code></pre>
                    </div>
                </section>
            </main>

            <footer>
                <p>版權所有 © 2024</p>
            </footer>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("\n1. 修改文本內容:")

    # 修改標題
    title_tag = soup.find('h1')
    if title_tag:
        old_title = title_tag.get_text()
        title_tag.string = "Python高級編程技巧"
        print(f"標題修改: '{old_title}' -> '{title_tag.get_text()}'")

    # 修改作者信息
    author_meta = soup.find('meta', attrs={'name': 'author'})
    if author_meta:
        old_author = author_meta.get('content')
        author_meta['content'] = "技術專家"
        print(f"作者修改: '{old_author}' -> '{author_meta.get('content')}'")

    # 修改發佈時間
    meta_p = soup.find('p', class_='meta')
    if meta_p:
        old_time = meta_p.get_text()
        meta_p.string = "發佈時間: 2024-01-15 (已更新)"
        print(f"時間修改: '{old_time}' -> '{meta_p.get_text()}'")

    print("\n2. 添加新元素:")

    # 在列表中添加新項目
    topic_list = soup.find('ul', id='topic-list')
    if topic_list:
        # 創建新的li元素
        new_li1 = soup.new_tag('li')
        new_li1.string = "函數和模塊"

        new_li2 = soup.new_tag('li')
        new_li2.string = "面向對象編程"

        new_li3 = soup.new_tag('li')
        new_li3.string = "異常處理"

        # 添加到列表末尾
        topic_list.append(new_li1)
        topic_list.append(new_li2)
        topic_list.append(new_li3)

        print(f"添加了3個新的主題項目")
        print(f"當前主題列表: {[li.get_text() for li in topic_list.find_all('li')]}")

    # 添加新的代碼示例
    examples_section = soup.find('section', class_='examples')
    if examples_section:
        # 創建新的代碼塊
        new_code_block = soup.new_tag('div', class_='code-block')
        new_pre = soup.new_tag('pre')
        new_code = soup.new_tag('code')
        new_code.string = '''def greet(name):
    return f"Hello, {name}!"

print(greet("Python"))'''

        new_pre.append(new_code)
        new_code_block.append(new_pre)
        examples_section.append(new_code_block)

        print("添加了新的代碼示例")

    # 添加新的section
    main_content = soup.find('main', class_='content')
    if main_content:
        new_section = soup.new_tag('section', class_='resources')
        new_h2 = soup.new_tag('h2')
        new_h2.string = "學習資源"

        new_ul = soup.new_tag('ul')
        resources = [
            "Python官方文檔",
            "在線編程練習",
            "開源項目參與"
        ]

        for resource in resources:
            li = soup.new_tag('li')
            li.string = resource
            new_ul.append(li)

        new_section.append(new_h2)
        new_section.append(new_ul)
        main_content.append(new_section)

        print("添加了新的學習資源section")

    print("\n3. 修改屬性:")

    # 修改容器類名
    container = soup.find('div', class_='container')
    if container:
        old_class = container.get('class')
        container['class'] = ['main-container', 'updated']
        container['data-version'] = '2.0'
        print(f"容器類名修改: {old_class} -> {container.get('class')}")
        print(f"添加了data-version屬性: {container.get('data-version')}")

    # 爲代碼塊添加語言標識
    code_blocks = soup.find_all('div', class_='code-block')
    for i, block in enumerate(code_blocks):
        block['data-language'] = 'python'
        block['data-line-numbers'] = 'true'
        print(f"代碼塊{i+1}添加了語言標識和行號屬性")

    print("\n4. 刪除元素:")

    # 刪除版權信息（示例）
    footer = soup.find('footer')
    if footer:
        copyright_p = footer.find('p')
        if copyright_p:
            old_text = copyright_p.get_text()
            copyright_p.decompose()  # 完全刪除元素
            print(f"刪除了版權信息: '{old_text}'")

    print("\n5. 元素移動和重排:")

    # 將簡介section移動到主要內容之後
    intro_section = soup.find('section', class_='intro')
    topics_section = soup.find('section', class_='topics')

    if intro_section and topics_section:
        # 從當前位置移除
        intro_section.extract()
        # 插入到topics_section之後
        topics_section.insert_after(intro_section)
        print("將簡介section移動到主要內容section之後")

    print("\n6. 批量操作:")

    # 爲所有h2標籤添加id屬性
    h2_tags = soup.find_all('h2')
    for h2 in h2_tags:
        # 生成id（將標題轉換爲合適的id格式）
        title_text = h2.get_text().lower().replace(' ', '-').replace('，', '')
        h2['id'] = f"section-{title_text}"
        print(f"爲h2標籤添加id: {h2['id']}")

    # 爲所有鏈接添加target="_blank"
    links = soup.find_all('a')
    for link in links:
        link['target'] = '_blank'
        link['rel'] = 'noopener noreferrer'

    if links:
        print(f"爲{len(links)}個鏈接添加了target和rel屬性")
    else:
        print("沒有找到鏈接元素")

    print("\n7. 條件修改:")

    # 只修改包含特定文本的元素
    all_p = soup.find_all('p')
    modified_count = 0

    for p in all_p:
        text = p.get_text()
        if 'Python' in text:
            # 添加強調樣式
            p['class'] = p.get('class', []) + ['python-related']
            p['style'] = 'font-weight: bold; color: #3776ab;'
            modified_count += 1

    print(f"爲{modified_count}個包含'Python'的段落添加了樣式")

    print("\n8. 創建複雜結構:")

    # 創建一個導航菜單
    nav = soup.new_tag('nav', class_='table-of-contents')
    nav_title = soup.new_tag('h3')
    nav_title.string = "目錄"
    nav_ul = soup.new_tag('ul')

    # 基於現有的h2標籤創建導航
    for h2 in soup.find_all('h2'):
        li = soup.new_tag('li')
        a = soup.new_tag('a', href=f"#{h2.get('id', '')}")
        a.string = h2.get_text()
        li.append(a)
        nav_ul.append(li)

    nav.append(nav_title)
    nav.append(nav_ul)

    # 將導航插入到header之後
    header = soup.find('header')
    if header:
        header.insert_after(nav)
        print("創建並插入了目錄導航")

    print("\n9. 文檔結構優化:")

    # 添加語義化標籤
    main_tag = soup.find('main')
    if main_tag:
        # 爲main標籤添加role屬性
        main_tag['role'] = 'main'
        main_tag['aria-label'] = '主要內容'
        print("爲main標籤添加了無障礙屬性")

    # 添加meta標籤
    head = soup.find('head')
    if head:
        # 添加viewport meta
        viewport_meta = soup.new_tag('meta', attrs={
            'name': 'viewport',
            'content': 'width=device-width, initial-scale=1.0'
        })

        # 添加description meta
        desc_meta = soup.new_tag('meta', attrs={
            'name': 'description',
            'content': 'Python高級編程技巧學習筆記，包含函數、面向對象編程、異常處理等內容。'
        })

        head.append(viewport_meta)
        head.append(desc_meta)
        print("添加了viewport和description meta標籤")

    print("\n10. 輸出修改後的文檔:")

    # 格式化輸出
    formatted_html = soup.prettify()
    print("修改後的HTML文檔:")
    print(formatted_html[:1000] + "..." if len(formatted_html) > 1000 else formatted_html)

    # 統計信息
    print(f"\n文檔統計:")
    print(f"  總標籤數: {len(soup.find_all())}")
    print(f"  段落數: {len(soup.find_all('p'))}")
    print(f"  標題數: {len(soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']))}")
    print(f"  列表項數: {len(soup.find_all('li'))}")
    print(f"  代碼塊數: {len(soup.find_all('div', class_='code-block'))}")

    return soup

# 運行文檔修改演示
if __name__ == "__main__":
    modified_soup = document_modification_demo()

終端日誌:

=== 文檔修改功能演示 ===

1. 修改文本內容:
標題修改: 'Python學習筆記' -> 'Python高級編程技巧'
作者修改: '原作者' -> '技術專家'
時間修改: '發佈時間: 2024-01-01' -> '發佈時間: 2024-01-15 (已更新)'

2. 添加新元素:
添加了3個新的主題項目
當前主題列表: ['變量和數據類型', '控制結構', '函數和模塊', '面向對象編程', '異常處理']
添加了新的代碼示例
添加了新的學習資源section

3. 修改屬性:
容器類名修改: ['container'] -> ['main-container', 'updated']
添加了data-version屬性: 2.0
代碼塊1添加了語言標識和行號屬性
代碼塊2添加了語言標識和行號屬性

4. 刪除元素:
刪除了版權信息: '版權所有 © 2024'

5. 元素移動和重排:
將簡介section移動到主要內容section之後

6. 批量操作:
爲h2標籤添加id: section-主要內容
爲h2標籤添加id: section-簡介
爲h2標籤添加id: section-代碼示例
爲h2標籤添加id: section-學習資源
沒有找到鏈接元素

7. 條件修改:
爲1個包含'Python'的段落添加了樣式

8. 創建複雜結構:
創建並插入了目錄導航

9. 文檔結構優化:
爲main標籤添加了無障礙屬性
添加了viewport和description meta標籤

10. 輸出修改後的文檔:
修改後的HTML文檔:
<!DOCTYPE html>
<html>
 <head>
  <title>
   我的博客文章
  </title>
  <meta content="技術專家" name="author"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Python高級編程技巧學習筆記，包含函數、面向對象編程、異常處理等內容。" name="description"/>
 </head>
 <body>
  <div class="main-container updated" data-version="2.0">
   <header>
    <h1>
     Python高級編程技巧
    </h1>
    <p class="meta">
     發佈時間: 2024-01-15 (已更新)
    </p>
   </header>
   <nav class="table-of-contents">
    <h3>
     目錄
    </h3>
    <ul>
     <li>
      <a href="#section-主要內容">
       主要內容
      </a>
     </li>
     <li>
      <a href="#section-簡介">
       簡介
      </a>
     </li>
     <li>
      <a href="#section-代碼示例">
       代碼示例
      </a>
     </li>
     <li>
      <a href="#section-學習資源">
       學習資源
      </a>
     </li>
    </ul>
   </nav>
   <main aria-label="主要內容" class="content" role="main">
    <section class="topics">
     <h2 id="section-主要內容">
      主要內容
     </h2>
     <ul id="topic-list">
      <li>
       變量和數據類型
      </li>
      <li>
       控制結構
      </li>
      <li>
       函數和模塊
      </li>
      <li>
       面向對象編程
      </li>
      <li>
       異常處理
      </li>
     </ul>
    </section>
    <section class="intro">
     <h2 id="section-簡介">
      簡介
     </h2>
     <p class="python-related" style="font-weight: bold; color: #3776ab;">
      這是一篇關於Python基礎的文章。
     </p>
    </section>
    <section class="examples">
     <h2 id="section-代碼示例">
      代碼示例
     </h2>
     <div class="code-block" data-language="python" data-line-numbers="true">
      <pre><code>print("Hello, World!")</code></pre>
     </div>
     <div class="code-block" data-language="python" data-line-numbers="true">
      <pre><code>def greet(name):
    return f"Hello, {name}!"

print(greet("Python"))</code></pre>
     </div>
    </section>
    <section class="resources">
     <h2 id="section-學習資源">
      學習資源
     </h2>
     <ul>
      <li>
       Python官方文檔
      </li>
      <li>
       在線編程練習
      </li>
      <li>
       開源項目參與
      </li>
     </ul>
    </section>
   </main>
   <footer>
   </footer>
  </div>
 </body>
</html>...

文檔統計:
  總標籤數: 32
  段落數: 1
  標題數: 5
  列表項數: 11
  代碼塊數: 2

元素插入和刪除¶

BeautifulSoup提供了靈活的元素插入和刪除方法。

def element_operations_demo():
    """
    演示元素插入和刪除操作
    """
    print("=== 元素插入和刪除操作演示 ===")

    # 示例HTML - 文章列表
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>文章管理系統</title>
    </head>
    <body>
        <div class="article-manager">
            <header class="page-header">
                <h1>文章列表</h1>
                <div class="actions">
                    <button class="btn-new">新建文章</button>
                </div>
            </header>

            <main class="article-list">
                <article class="article-item" data-id="1">
                    <h2 class="article-title">Python基礎教程</h2>
                    <p class="article-summary">學習Python編程的基礎知識</p>
                    <div class="article-meta">
                        <span class="author">作者: 張三</span>
                        <span class="date">2024-01-01</span>
                        <span class="category">編程</span>
                    </div>
                    <div class="article-actions">
                        <button class="btn-edit">編輯</button>
                        <button class="btn-delete">刪除</button>
                    </div>
                </article>

                <article class="article-item" data-id="2">
                    <h2 class="article-title">Web開發入門</h2>
                    <p class="article-summary">從零開始學習Web開發</p>
                    <div class="article-meta">
                        <span class="author">作者: 李四</span>
                        <span class="date">2024-01-05</span>
                        <span class="category">Web開發</span>
                    </div>
                    <div class="article-actions">
                        <button class="btn-edit">編輯</button>
                        <button class="btn-delete">刪除</button>
                    </div>
                </article>
            </main>

            <footer class="page-footer">
                <p>共 2 篇文章</p>
            </footer>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("\n1. 在指定位置插入元素:")

    # 在第一篇文章前插入新文章
    article_list = soup.find('main', class_='article-list')
    first_article = soup.find('article', class_='article-item')

    if article_list and first_article:
        # 創建新文章
        new_article = soup.new_tag('article', class_='article-item featured', **{'data-id': '0'})

        # 創建文章標題
        title = soup.new_tag('h2', class_='article-title')
        title.string = "🔥 熱門推薦：Python高級特性詳解"

        # 創建文章摘要
        summary = soup.new_tag('p', class_='article-summary')
        summary.string = "深入瞭解Python的高級特性和最佳實踐"

        # 創建元數據
        meta_div = soup.new_tag('div', class_='article-meta')

        author_span = soup.new_tag('span', class_='author')
        author_span.string = "作者: 技術專家"

        date_span = soup.new_tag('span', class_='date')
        date_span.string = "2024-01-15"

        category_span = soup.new_tag('span', class_='category featured-category')
        category_span.string = "高級編程"

        meta_div.extend([author_span, date_span, category_span])

        # 創建操作按鈕
        actions_div = soup.new_tag('div', class_='article-actions')

        edit_btn = soup.new_tag('button', class_='btn-edit')
        edit_btn.string = "編輯"

        delete_btn = soup.new_tag('button', class_='btn-delete')
        delete_btn.string = "刪除"

        pin_btn = soup.new_tag('button', class_='btn-pin')
        pin_btn.string = "置頂"

        actions_div.extend([edit_btn, delete_btn, pin_btn])

        # 組裝新文章
        new_article.extend([title, summary, meta_div, actions_div])

        # 插入到第一篇文章前
        first_article.insert_before(new_article)

        print("在列表開頭插入了特色文章")

    # 在最後一篇文章後插入新文章
    all_articles = soup.find_all('article', class_='article-item')
    if all_articles:
        last_article = all_articles[-1]

        # 創建另一篇新文章
        another_article = soup.new_tag('article', class_='article-item draft', **{'data-id': '3'})

        title = soup.new_tag('h2', class_='article-title')
        title.string = "📝 草稿：數據庫設計原理"

        summary = soup.new_tag('p', class_='article-summary')
        summary.string = "數據庫設計的基本原理和最佳實踐（草稿狀態）"

        meta_div = soup.new_tag('div', class_='article-meta')

        author_span = soup.new_tag('span', class_='author')
        author_span.string = "作者: 王五"

        date_span = soup.new_tag('span', class_='date')
        date_span.string = "2024-01-16"

        status_span = soup.new_tag('span', class_='status draft-status')
        status_span.string = "草稿"

        meta_div.extend([author_span, date_span, status_span])

        actions_div = soup.new_tag('div', class_='article-actions')

        edit_btn = soup.new_tag('button', class_='btn-edit primary')
        edit_btn.string = "繼續編輯"

        publish_btn = soup.new_tag('button', class_='btn-publish')
        publish_btn.string = "發佈"

        delete_btn = soup.new_tag('button', class_='btn-delete')
        delete_btn.string = "刪除"

        actions_div.extend([edit_btn, publish_btn, delete_btn])

        another_article.extend([title, summary, meta_div, actions_div])

        # 插入到最後一篇文章後
        last_article.insert_after(another_article)

        print("在列表末尾插入了草稿文章")

    print("\n2. 在父元素中插入子元素:")

    # 在頁面頭部添加搜索框
    page_header = soup.find('header', class_='page-header')
    if page_header:
        # 創建搜索區域
        search_div = soup.new_tag('div', class_='search-area')

        search_input = soup.new_tag('input', type='text', placeholder='搜索文章...', class_='search-input')
        search_btn = soup.new_tag('button', class_='btn-search')
        search_btn.string = "搜索"

        search_div.extend([search_input, search_btn])

        # 插入到actions div之前
        actions_div = page_header.find('div', class_='actions')
        if actions_div:
            actions_div.insert_before(search_div)
            print("在頁面頭部添加了搜索區域")

    # 在每篇文章中添加標籤
    articles = soup.find_all('article', class_='article-item')
    for i, article in enumerate(articles):
        meta_div = article.find('div', class_='article-meta')
        if meta_div:
            # 創建標籤容器
            tags_div = soup.new_tag('div', class_='article-tags')

            # 根據文章類型添加不同標籤
            if 'featured' in article.get('class', []):
                tags = ['熱門', '推薦', 'Python']
            elif 'draft' in article.get('class', []):
                tags = ['草稿', '數據庫']
            else:
                tags = ['基礎', '教程']

            for tag in tags:
                tag_span = soup.new_tag('span', class_='tag')
                tag_span.string = tag
                tags_div.append(tag_span)

            # 插入到meta div之後
            meta_div.insert_after(tags_div)

        print(f"爲文章{i+1}添加了標籤")

    print("\n3. 刪除元素:")

    # 刪除第二篇文章（原來的第一篇）
    articles = soup.find_all('article', class_='article-item')
    if len(articles) > 1:
        article_to_delete = articles[1]  # 第二篇文章
        article_title = article_to_delete.find('h2', class_='article-title')
        title_text = article_title.get_text() if article_title else "未知標題"

        article_to_delete.decompose()  # 完全刪除
        print(f"刪除了文章: '{title_text}'")

    # 刪除所有草稿狀態的文章
    draft_articles = soup.find_all('article', class_='draft')
    deleted_drafts = []

    for draft in draft_articles:
        title_elem = draft.find('h2', class_='article-title')
        if title_elem:
            deleted_drafts.append(title_elem.get_text())
        draft.decompose()

    if deleted_drafts:
        print(f"刪除了草稿文章: {deleted_drafts}")
    else:
        print("沒有找到草稿文章")

    # 刪除特定的按鈕
    pin_buttons = soup.find_all('button', class_='btn-pin')
    for btn in pin_buttons:
        btn.decompose()

    if pin_buttons:
        print(f"刪除了{len(pin_buttons)}個置頂按鈕")

    print("\n4. 替換元素:")

    # 替換頁面標題
    page_title = soup.find('h1')
    if page_title:
        old_title = page_title.get_text()

        # 創建新的標題元素
        new_title = soup.new_tag('h1', class_='main-title')
        new_title.string = "📚 技術文章管理中心"

        # 替換
        page_title.replace_with(new_title)
        print(f"頁面標題替換: '{old_title}' -> '{new_title.get_text()}'")

    # 替換所有編輯按鈕爲更詳細的按鈕
    edit_buttons = soup.find_all('button', class_='btn-edit')
    for btn in edit_buttons:
        # 創建新的按鈕組
        btn_group = soup.new_tag('div', class_='btn-group')

        quick_edit = soup.new_tag('button', class_='btn-quick-edit')
        quick_edit.string = "快速編輯"

        full_edit = soup.new_tag('button', class_='btn-full-edit')
        full_edit.string = "完整編輯"

        btn_group.extend([quick_edit, full_edit])

        # 替換原按鈕
        btn.replace_with(btn_group)

    print(f"替換了{len(edit_buttons)}個編輯按鈕爲按鈕組")

    print("\n5. 移動元素:")

    # 將搜索區域移動到標題之前
    search_area = soup.find('div', class_='search-area')
    main_title = soup.find('h1', class_='main-title')

    if search_area and main_title:
        # 提取搜索區域
        search_area.extract()
        # 插入到標題之前
        main_title.insert_before(search_area)
        print("將搜索區域移動到標題之前")

    # 重新排序文章（按日期）
    article_list = soup.find('main', class_='article-list')
    if article_list:
        articles = article_list.find_all('article', class_='article-item')

        # 提取所有文章
        article_data = []
        for article in articles:
            date_elem = article.find('span', class_='date')
            date_str = date_elem.get_text() if date_elem else "2024-01-01"
            article_data.append((date_str, article.extract()))

        # 按日期排序（最新的在前）
        article_data.sort(key=lambda x: x[0], reverse=True)

        # 重新插入排序後的文章
        for date_str, article in article_data:
            article_list.append(article)

        print(f"按日期重新排序了{len(article_data)}篇文章")

    print("\n6. 批量操作:")

    # 爲所有文章添加閱讀時間估算
    articles = soup.find_all('article', class_='article-item')
    for article in articles:
        summary = article.find('p', class_='article-summary')
        if summary:
            # 估算閱讀時間（基於摘要長度）
            text_length = len(summary.get_text())
            read_time = max(1, text_length // 50)  # 假設每50個字符需要1分鐘

            read_time_span = soup.new_tag('span', class_='read-time')
            read_time_span.string = f"預計閱讀: {read_time}分鐘"

            # 插入到摘要之後
            summary.insert_after(read_time_span)

    print(f"爲{len(articles)}篇文章添加了閱讀時間估算")

    # 更新文章計數
    footer = soup.find('footer', class_='page-footer')
    if footer:
        count_p = footer.find('p')
        if count_p:
            current_count = len(soup.find_all('article', class_='article-item'))
            count_p.string = f"共 {current_count} 篇文章"
            print(f"更新了文章計數: {current_count}")

    print("\n7. 條件操作:")

    # 只對特色文章添加特殊標記
    featured_articles = soup.find_all('article', class_='featured')
    for article in featured_articles:
        title = article.find('h2', class_='article-title')
        if title and not title.get_text().startswith('🔥'):
            title.string = f"🔥 {title.get_text()}"

    print(f"爲{len(featured_articles)}篇特色文章添加了火焰標記")

    # 爲長摘要添加展開/收起功能
    summaries = soup.find_all('p', class_='article-summary')
    long_summaries = 0

    for summary in summaries:
        if len(summary.get_text()) > 30:  # 超過30個字符認爲是長摘要
            summary['class'] = summary.get('class', []) + ['long-summary']
            summary['data-full-text'] = summary.get_text()

            # 創建展開按鈕
            expand_btn = soup.new_tag('button', class_='btn-expand')
            expand_btn.string = "展開"

            summary.insert_after(expand_btn)
            long_summaries += 1

    print(f"爲{long_summaries}個長摘要添加了展開功能")

    print("\n8. 最終文檔統計:")

    # 統計最終結果
    final_stats = {
        '總文章數': len(soup.find_all('article', class_='article-item')),
        '特色文章數': len(soup.find_all('article', class_='featured')),
        '草稿文章數': len(soup.find_all('article', class_='draft')),
        '總按鈕數': len(soup.find_all('button')),
        '標籤數': len(soup.find_all('span', class_='tag')),
        '總元素數': len(soup.find_all())
    }

    for key, value in final_stats.items():
        print(f"  {key}: {value}")

    # 輸出部分修改後的HTML
    print("\n9. 修改後的HTML片段:")
    article_list = soup.find('main', class_='article-list')
    if article_list:
        first_article = article_list.find('article')
        if first_article:
            print(first_article.prettify()[:500] + "...")

    return soup

# 運行元素操作演示
if __name__ == "__main__":
    modified_soup = element_operations_demo()

編碼處理¶

BeautifulSoup能夠自動處理各種字符編碼問題。

def encoding_demo():
    """
    演示編碼處理功能
    """
    print("=== 編碼處理功能演示 ===")

    # 1. 自動編碼檢測
    print("\n1. 自動編碼檢測:")

    # 不同編碼的HTML內容
    utf8_html = """
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>中文測試頁面</title>
    </head>
    <body>
        <h1>歡迎來到Python學習網站</h1>
        <p>這裏有豐富的Python教程和實例。</p>
        <div class="content">
            <h2>特殊字符測試</h2>
            <p>數學符號: α β γ δ ε ∑ ∏ ∫</p>
            <p>貨幣符號: ¥ $ € £ ₹</p>
            <p>表情符號: 😀 😃 😄 😁 🚀 🎉</p>
            <p>其他語言: こんにちは 안녕하세요 Здравствуйте</p>
        </div>
    </body>
    </html>
    """

    # 使用BeautifulSoup解析UTF-8內容
    soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
    print(f"UTF-8解析結果:")
    print(f"  標題: {soup_utf8.find('title').get_text()}")
    print(f"  主標題: {soup_utf8.find('h1').get_text()}")

    # 獲取原始編碼信息
    original_encoding = soup_utf8.original_encoding
    print(f"  檢測到的原始編碼: {original_encoding}")

    # 2. 處理不同編碼的內容
    print("\n2. 處理不同編碼的內容:")

    # 模擬GBK編碼的內容
    gbk_content = "<html><body><h1>中文標題</h1><p>這是GBK編碼的內容</p></body></html>"

    try:
        # 將字符串編碼爲GBK字節
        gbk_bytes = gbk_content.encode('gbk')
        print(f"GBK字節長度: {len(gbk_bytes)}")

        # 使用BeautifulSoup解析GBK字節
        soup_gbk = BeautifulSoup(gbk_bytes, 'html.parser', from_encoding='gbk')
        print(f"GBK解析結果:")
        print(f"  標題: {soup_gbk.find('h1').get_text()}")
        print(f"  段落: {soup_gbk.find('p').get_text()}")

    except UnicodeEncodeError as e:
        print(f"GBK編碼錯誤: {e}")

    # 3. 編碼轉換
    print("\n3. 編碼轉換:")

    # 獲取不同編碼格式的輸出
    html_str = str(soup_utf8)

    # UTF-8編碼
    utf8_bytes = html_str.encode('utf-8')
    print(f"UTF-8編碼字節數: {len(utf8_bytes)}")

    # 嘗試其他編碼
    encodings_to_test = ['utf-8', 'gbk', 'iso-8859-1', 'ascii']

    for encoding in encodings_to_test:
        try:
            encoded_bytes = html_str.encode(encoding)
            print(f"{encoding.upper()}編碼: 成功，{len(encoded_bytes)}字節")
        except UnicodeEncodeError as e:
            print(f"{encoding.upper()}編碼: 失敗 - {str(e)[:50]}...")

    # 4. 處理編碼錯誤
    print("\n4. 處理編碼錯誤:")

    # 創建包含特殊字符的內容
    special_html = """
    <html>
    <body>
        <h1>特殊字符處理測試</h1>
        <p>包含emoji: 🐍 Python編程</p>
        <p>數學公式: E = mc²</p>
        <p>版權符號: © 2024</p>
        <p>商標符號: Python™</p>
    </body>
    </html>
    """

    soup_special = BeautifulSoup(special_html, 'html.parser')

    # 不同的錯誤處理策略
    error_strategies = ['ignore', 'replace', 'xmlcharrefreplace']

    for strategy in error_strategies:
        try:
            # 嘗試編碼爲ASCII（會出錯）
            ascii_result = str(soup_special).encode('ascii', errors=strategy)
            decoded_result = ascii_result.decode('ascii')
            print(f"ASCII編碼策略'{strategy}': 成功")
            print(f"  結果長度: {len(decoded_result)}字符")

            # 顯示處理後的標題
            soup_result = BeautifulSoup(decoded_result, 'html.parser')
            title = soup_result.find('h1')
            if title:
                print(f"  處理後標題: {title.get_text()}")

        except Exception as e:
            print(f"ASCII編碼策略'{strategy}': 失敗 - {e}")

    # 5. 自定義編碼處理
    print("\n5. 自定義編碼處理:")

    def safe_encode_html(soup_obj, target_encoding='utf-8', fallback_encoding='ascii'):
        """
        安全地將BeautifulSoup對象編碼爲指定格式
        """
        html_str = str(soup_obj)

        try:
            # 嘗試目標編碼
            return html_str.encode(target_encoding)
        except UnicodeEncodeError:
            print(f"  {target_encoding}編碼失敗，嘗試{fallback_encoding}")
            try:
                # 使用替換策略的後備編碼
                return html_str.encode(fallback_encoding, errors='xmlcharrefreplace')
            except UnicodeEncodeError:
                print(f"  {fallback_encoding}編碼也失敗，使用忽略策略")
                return html_str.encode(fallback_encoding, errors='ignore')

    # 測試自定義編碼函數
    safe_bytes = safe_encode_html(soup_special, 'ascii')
    print(f"安全編碼結果: {len(safe_bytes)}字節")

    # 解碼並驗證
    safe_html = safe_bytes.decode('ascii')
    safe_soup = BeautifulSoup(safe_html, 'html.parser')
    safe_title = safe_soup.find('h1')
    if safe_title:
        print(f"安全編碼後標題: {safe_title.get_text()}")

    # 6. 編碼聲明處理
    print("\n6. 編碼聲明處理:")

    # 檢查和修改編碼聲明
    meta_charset = soup_utf8.find('meta', attrs={'charset': True})
    if meta_charset:
        original_charset = meta_charset.get('charset')
        print(f"原始字符集聲明: {original_charset}")

        # 修改字符集聲明
        meta_charset['charset'] = 'UTF-8'
        print(f"修改後字符集聲明: {meta_charset.get('charset')}")

    # 添加編碼聲明（如果不存在）
    head = soup_utf8.find('head')
    if head and not head.find('meta', attrs={'charset': True}):
        charset_meta = soup_utf8.new_tag('meta', charset='UTF-8')
        head.insert(0, charset_meta)
        print("添加了字符集聲明")

    # 7. 內容編碼驗證
    print("\n7. 內容編碼驗證:")

    def validate_encoding(html_content, expected_encoding='utf-8'):
        """
        驗證HTML內容的編碼
        """
        try:
            if isinstance(html_content, str):
                # 字符串內容，嘗試編碼
                html_content.encode(expected_encoding)
                return True, "字符串內容編碼有效"
            elif isinstance(html_content, bytes):
                # 字節內容，嘗試解碼
                html_content.decode(expected_encoding)
                return True, "字節內容編碼有效"
            else:
                return False, "未知內容類型"
        except UnicodeError as e:
            return False, f"編碼驗證失敗: {e}"

    # 驗證不同內容的編碼
    test_contents = [
        (utf8_html, 'utf-8'),
        (str(soup_utf8), 'utf-8'),
        (str(soup_special), 'utf-8')
    ]

    for content, encoding in test_contents:
        is_valid, message = validate_encoding(content, encoding)
        print(f"  {encoding}編碼驗證: {'✓' if is_valid else '✗'} {message}")

    # 8. 編碼統計信息
    print("\n8. 編碼統計信息:")

    def analyze_encoding(soup_obj):
        """
        分析BeautifulSoup對象的編碼信息
        """
        html_str = str(soup_obj)

        stats = {
            '總字符數': len(html_str),
            'ASCII字符數': sum(1 for c in html_str if ord(c) < 128),
            '非ASCII字符數': sum(1 for c in html_str if ord(c) >= 128),
            '中文字符數': sum(1 for c in html_str if '\u4e00' <= c <= '\u9fff'),
            '表情符號數': sum(1 for c in html_str if ord(c) > 0x1F600),
        }

        # 計算不同編碼的字節數
        for encoding in ['utf-8', 'utf-16', 'utf-32']:
            try:
                byte_count = len(html_str.encode(encoding))
                stats[f'{encoding.upper()}字節數'] = byte_count
            except UnicodeEncodeError:
                stats[f'{encoding.upper()}字節數'] = '編碼失敗'

        return stats

    # 分析特殊字符內容
    encoding_stats = analyze_encoding(soup_special)

    print("特殊字符內容編碼分析:")
    for key, value in encoding_stats.items():
        print(f"  {key}: {value}")

    # 9. 編碼最佳實踐建議
    print("\n9. 編碼最佳實踐建議:")

    recommendations = [
        "✓ 始終使用UTF-8編碼處理HTML內容",
        "✓ 在HTML頭部明確聲明字符集",
        "✓ 處理用戶輸入時驗證編碼",
        "✓ 使用適當的錯誤處理策略",
        "✓ 測試特殊字符和多語言內容",
        "✓ 避免混合使用不同編碼"
    ]

    for rec in recommendations:
        print(f"  {rec}")

    return soup_utf8, soup_special

# 運行編碼處理演示
if __name__ == "__main__":
    utf8_soup, special_soup = encoding_demo()

終端日誌:

=== 編碼處理功能演示 ===

1. 自動編碼檢測:
UTF-8解析結果:
  標題: 中文測試頁面
  主標題: 歡迎來到Python學習網站
  檢測到的原始編碼: None

2. 處理不同編碼的內容:
GBK字節長度: 59
GBK解析結果:
  標題: 中文標題
  段落: 這是GBK編碼的內容

3. 編碼轉換:
UTF-8編碼字節數: 674
UTF-8編碼: 成功，674字節
GBK編碼: 成功，638字節
ISO-8859-1編碼: 失敗 - 'latin-1' codec can't encode character '\u4e2d'...
ASCII編碼: 失敗 - 'ascii' codec can't encode character '\u4e2d' in...

4. 處理編碼錯誤:
ASCII編碼策略'ignore': 成功
  結果長度: 158字符
  處理後標題: 
ASCII編碼策略'replace': 成功
  結果長度: 398字符
  處理後標題: ????????????
ASCII編碼策略'xmlcharrefreplace': 成功
  結果長度: 1058字符
  處理後標題: 特殊字符處理測試

5. 自定義編碼處理:
  utf-8編碼失敗，嘗試ascii
安全編碼結果: 1058字節
安全編碼後標題: 特殊字符處理測試

6. 編碼聲明處理:
原始字符集聲明: UTF-8
修改後字符集聲明: UTF-8

7. 內容編碼驗證:
  utf-8編碼驗證: ✓ 字符串內容編碼有效
  utf-8編碼驗證: ✓ 字符串內容編碼有效
  utf-8編碼驗證: ✓ 字符串內容編碼有效

8. 編碼統計信息:
特殊字符內容編碼分析:
  總字符數: 254
  ASCII字符數: 158
  非ASCII字符數: 96
  中文字符數: 12
  表情符號數: 1
  UTF-8字節數: 302
  UTF-16字節數: 510
  UTF-32字節數: 1018

9. 編碼最佳實踐建議:
  ✓ 始終使用UTF-8編碼處理HTML內容
  ✓ 在HTML頭部明確聲明字符集
  ✓ 處理用戶輸入時驗證編碼
  ✓ 使用適當的錯誤處理策略
  ✓ 測試特殊字符和多語言內容
  ✓ 避免混合使用不同編碼

第14章 爬蟲與自動化¶

14.1 網絡爬蟲基礎¶

爬蟲概述¶

網絡爬蟲的定義和用途¶

爬蟲的工作原理¶

爬蟲的分類和特點¶

爬蟲的法律和道德考量¶

HTTP協議基礎¶

HTTP請求和響應¶

Cookie和Session機制¶

網頁結構分析¶

HTML基礎結構¶

CSS選擇器¶

JavaScript和動態內容¶

網頁編碼和字符集¶

爬蟲開發環境¶

開發工具選擇¶

代理和IP池¶

用戶代理設置¶

調試和測試工具¶

14.2 Requests庫網絡請求¶

Requests基礎¶

安裝和基本使用¶

GET和POST請求¶

請求參數和頭部¶

請求參數詳解¶

請求頭部詳解¶

響應對象處理¶

高級功能¶

Session會話管理¶

身份驗證¶

代理設置和SSL配置¶

Cookie處理¶

文件上傳和下載¶

超時和重試機制¶

異常處理¶

14.3 BeautifulSoup網頁解析¶

BeautifulSoup基礎¶

HTML解析¶

CSS選擇器¶

數據提取¶

高級操作¶

文檔修改¶

元素插入和刪除¶

編碼處理¶

相關文章

第14章爬蟲與自動化¶