第14章 爬蟲與自動化

網絡爬蟲是現代數據獲取和自動化處理的重要技術手段,通過模擬瀏覽器行爲自動訪問網頁並提取所需信息。本章將從基礎概念開始,逐步深入到高級爬蟲框架和自動化技術,幫助讀者掌握完整的爬蟲開發技能。

14.1 網絡爬蟲基礎

爬蟲概述

網絡爬蟲的定義和用途

網絡爬蟲(Web Crawler),也稱爲網頁蜘蛛(Web Spider)或網絡機器人(Web Robot),是一種按照一定規則自動瀏覽萬維網並獲取信息的程序。爬蟲的主要用途包括:

  1. 數據採集:從網站獲取商品信息、新聞資訊、股票價格等
  2. 搜索引擎:爲搜索引擎建立索引數據庫
  3. 市場分析:收集競爭對手信息,進行市場調研
  4. 內容監控:監控網站內容變化,及時獲取更新
  5. 學術研究:收集研究數據,進行數據分析

爬蟲的工作原理

網絡爬蟲的基本工作流程如下:

  1. 發送HTTP請求:向目標網站發送請求
  2. 接收響應數據:獲取服務器返回的HTML頁面
  3. 解析頁面內容:提取所需的數據信息
  4. 存儲數據:將提取的數據保存到文件或數據庫
  5. 發現新鏈接:從當前頁面中發現新的URL
  6. 重複過程:對新發現的URL重複上述過程

讓我們通過一個簡單的示例來理解爬蟲的基本原理:

import requests
from bs4 import BeautifulSoup
import time

def simple_crawler(url):
    """
    簡單的網頁爬蟲示例
    """
    try:
        # 1. 發送HTTP請求
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        # 2. 檢查響應狀態
        if response.status_code == 200:
            # 3. 解析頁面內容
            soup = BeautifulSoup(response.text, 'html.parser')

            # 4. 提取標題
            title = soup.find('title')
            if title:
                print(f"頁面標題: {title.get_text().strip()}")

            # 5. 提取所有鏈接
            links = soup.find_all('a', href=True)
            print(f"找到 {len(links)} 個鏈接:")

            for i, link in enumerate(links[:5]):  # 只顯示前5個鏈接
                href = "https://yeyupiaoling.cn" + link['href']
                text = link.get_text().strip()
                print(f"{i+1}. {text} -> {href}")

        else:
            print(f"請求失敗,狀態碼: {response.status_code}")

    except Exception as e:
        print(f"爬取過程中出現錯誤: {e}")

# 使用示例
if __name__ == "__main__":
    url = "https://yeyupiaoling.cn"
    simple_crawler(url)

運行上述代碼,輸出類似如下:

頁面標題: 夜雨飄零的博客 - 首頁
找到 50 個鏈接:
1.  -> https://yeyupiaoling.cn/
2. 夜雨飄零 -> https://yeyupiaoling.cn/
3. 首頁 -> https://yeyupiaoling.cn/
4. 歸檔 -> https://yeyupiaoling.cn/archive
5. 標籤 -> https://yeyupiaoling.cn/tag

爬蟲的分類和特點

根據不同的分類標準,爬蟲可以分爲以下幾類:

按照爬取範圍分類:
- 通用爬蟲:搜索引擎使用的爬蟲,爬取整個互聯網
- 聚焦爬蟲:針對特定主題或網站的爬蟲
- 增量爬蟲:只爬取新增或更新的內容

按照技術實現分類:
- 靜態爬蟲:只能處理靜態HTML頁面
- 動態爬蟲:能夠處理JavaScript渲染的動態頁面

按照爬取深度分類:
- 淺層爬蟲:只爬取首頁或少數幾層頁面
- 深層爬蟲:能夠深入爬取網站的多層結構

爬蟲的法律和道德考量

在進行網絡爬蟲開發時,必須遵守相關的法律法規和道德準則:

  1. 遵守robots.txt協議:檢查網站的robots.txt文件
  2. 控制爬取頻率:避免對服務器造成過大壓力
  3. 尊重版權:不要爬取受版權保護的內容
  4. 保護隱私:不要爬取個人隱私信息
  5. 合理使用數據:僅將爬取的數據用於合法目的

HTTP協議基礎

HTTP請求和響應

HTTP(HyperText Transfer Protocol)是網絡爬蟲與Web服務器通信的基礎協議。理解HTTP協議對於開發高效的爬蟲至關重要。

HTTP通信包含兩個主要部分:
- 請求(Request):客戶端向服務器發送的消息
- 響應(Response):服務器返回給客戶端的消息

讓我們通過代碼來觀察HTTP請求和響應的詳細信息:

import requests
import json

def analyze_http_communication(url):
    """
    分析HTTP請求和響應的詳細信息
    """
    # 創建會話對象
    session = requests.Session()

    # 設置請求頭
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    }

    try:
        # 發送請求
        response = session.get(url, headers=headers)

        print("=== HTTP請求信息 ===")
        print(f"請求URL: {response.request.url}")
        print(f"請求方法: {response.request.method}")
        print("請求頭:")
        for key, value in response.request.headers.items():
            print(f"  {key}: {value}")

        print("\n=== HTTP響應信息 ===")
        print(f"狀態碼: {response.status_code}")
        print(f"響應原因: {response.reason}")
        print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
        print("響應頭:")
        for key, value in response.headers.items():
            print(f"  {key}: {value}")

        print(f"\n響應內容長度: {len(response.text)} 字符")
        print(f"響應內容類型: {response.headers.get('Content-Type', 'Unknown')}")

    except requests.RequestException as e:
        print(f"請求失敗: {e}")

# 使用示例
if __name__ == "__main__":
    analyze_http_communication("https://yeyupiaoling.cn/")

運行結果示例:

=== HTTP請求信息 ===
請求URL: https://yeyupiaoling.cn/
請求方法: GET
請求頭:
  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
  Accept-Encoding: gzip, deflate
  Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
  Connection: keep-alive
  Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3

=== HTTP響應信息 ===
狀態碼: 200
響應原因: OK
響應時間: 0.197秒
響應頭:
  Server: nginx/1.18.0 (Ubuntu)
  Date: Sat, 16 Aug 2025 04:36:49 GMT
  Content-Type: text/html; charset=utf-8
  Transfer-Encoding: chunked
  Connection: keep-alive
  Vary: Cookie
  Content-Encoding: gzip

響應內容長度: 29107 字符
響應內容類型: text/html; charset=utf-8

Cookie和Session機制

Cookie和Session是Web應用中維持用戶狀態的重要機制:

  • Cookie:存儲在客戶端的小型數據文件
  • Session:存儲在服務器端的用戶會話信息

在爬蟲開發中,正確處理Cookie和Session對於模擬用戶登錄和維持會話狀態至關重要:

import requests
from http.cookies import SimpleCookie

def demonstrate_cookies_and_sessions():
    """
    演示Cookie和Session的使用
    """
    # 創建會話對象
    session = requests.Session()

    print("=== Cookie操作演示 ===")

    # 1. 設置Cookie
    cookie_url = "https://httpbin.org/cookies/set"
    cookie_params = {
        'username': 'testuser',
        'session_id': 'abc123',
        'preferences': 'dark_theme'
    }

    # 設置Cookie(這會導致重定向)
    response = session.get(cookie_url, params=cookie_params)
    print(f"設置Cookie後的狀態碼: {response.status_code}")

    # 2. 查看當前Cookie
    print("\n當前會話中的Cookie:")
    for cookie in session.cookies:
        print(f"  {cookie.name} = {cookie.value}")

    # 3. 發送帶Cookie的請求
    cookie_test_url = "https://httpbin.org/cookies"
    response = session.get(cookie_test_url)
    if response.status_code == 200:
        cookies_data = response.json()
        print(f"\n服務器接收到的Cookie: {cookies_data.get('cookies', {})}")

    # 4. 手動設置Cookie
    print("\n=== 手動Cookie操作 ===")
    manual_session = requests.Session()

    # 方法1:通過字典設置
    manual_session.cookies.update({
        'user_id': '12345',
        'auth_token': 'xyz789'
    })

    # 方法2:通過set方法設置
    manual_session.cookies.set('language', 'zh-CN', domain='httpbin.org')

    # 測試手動設置的Cookie
    response = manual_session.get("https://httpbin.org/cookies")
    if response.status_code == 200:
        cookies_data = response.json()
        print(f"手動設置的Cookie: {cookies_data.get('cookies', {})}")

    # 5. Cookie持久化
    print("\n=== Cookie持久化 ===")

    # 保存Cookie到文件
    import pickle

    # 保存Cookie
    with open('cookies.pkl', 'wb') as f:
        pickle.dump(session.cookies, f)
    print("Cookie已保存到文件")

    # 加載Cookie
    new_session = requests.Session()
    try:
        with open('cookies.pkl', 'rb') as f:
            new_session.cookies = pickle.load(f)
        print("Cookie已從文件加載")

        # 測試加載的Cookie
        response = new_session.get("https://httpbin.org/cookies")
        if response.status_code == 200:
            cookies_data = response.json()
            print(f"加載的Cookie: {cookies_data.get('cookies', {})}")
    except FileNotFoundError:
        print("Cookie文件不存在")

# 模擬登錄示例
def simulate_login_with_session():
    """
    模擬網站登錄過程
    """
    print("\n=== 模擬登錄流程 ===")

    session = requests.Session()

    # 1. 訪問登錄頁面(獲取必要的Cookie和token)
    login_page_url = "https://httpbin.org/cookies/set/csrf_token/abc123def456"
    response = session.get(login_page_url)
    print(f"訪問登錄頁面: {response.status_code}")

    # 2. 提交登錄表單
    login_data = {
        'username': 'testuser',
        'password': 'testpass',
        'csrf_token': 'abc123def456'
    }

    login_url = "https://httpbin.org/post"
    response = session.post(login_url, data=login_data)

    if response.status_code == 200:
        print("登錄請求發送成功")
        response_data = response.json()
        print(f"提交的登錄數據: {response_data.get('form', {})}")

    # 3. 訪問需要登錄的頁面
    protected_url = "https://httpbin.org/cookies"
    response = session.get(protected_url)

    if response.status_code == 200:
        print("成功訪問受保護頁面")
        cookies_data = response.json()
        print(f"當前會話Cookie: {cookies_data.get('cookies', {})}")

# 運行演示
if __name__ == "__main__":
    demonstrate_cookies_and_sessions()
    simulate_login_with_session()

運行結果:

=== Cookie操作演示 ===
設置Cookie後的狀態碼: 200

當前會話中的Cookie:
  username = testuser
  session_id = abc123
  preferences = dark_theme

服務器接收到的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}

=== 手動Cookie操作 ===
手動設置的Cookie: {'user_id': '12345', 'auth_token': 'xyz789', 'language': 'zh-CN'}

=== Cookie持久化 ===
Cookie已保存到文件
Cookie已從文件加載
加載的Cookie: {'username': 'testuser', 'session_id': 'abc123', 'preferences': 'dark_theme'}

=== 模擬登錄流程 ===
訪問登錄頁面: 200
登錄請求發送成功
提交的登錄數據: {'username': 'testuser', 'password': 'testpass', 'csrf_token': 'abc123def456'}
成功訪問受保護頁面
當前會話Cookie: {'csrf_token': 'abc123def456'}

網頁結構分析

HTML基礎結構

理解HTML結構是網頁數據提取的基礎。HTML(HyperText Markup Language)使用標籤來定義網頁內容的結構和語義。

一個典型的HTML頁面結構如下:

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>頁面標題</title>
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <header>
        <nav>
            <ul>
                <li><a href="#home">首頁</a></li>
                <li><a href="#about">關於</a></li>
            </ul>
        </nav>
    </header>

    <main>
        <article>
            <h1>文章標題</h1>
            <p class="content">文章內容...</p>
        </article>
    </main>

    <footer>
        <p>&copy; 2024 版權信息</p>
    </footer>

    <script src="script.js"></script>
</body>
</html>

讓我們編寫一個HTML結構分析工具:

import requests
from bs4 import BeautifulSoup
from collections import Counter

def analyze_html_structure(url):
    """
    分析網頁的HTML結構
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            print(f"=== HTML結構分析: {url} ===")

            # 1. 基本信息
            title = soup.find('title')
            print(f"頁面標題: {title.get_text().strip() if title else '無標題'}")

            # 2. 文檔類型和編碼
            doctype = soup.contents[0] if soup.contents and hasattr(soup.contents[0], 'string') else None
            print(f"文檔類型: {doctype if doctype else 'HTML5'}")

            charset_meta = soup.find('meta', attrs={'charset': True})
            if not charset_meta:
                charset_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
            encoding = charset_meta.get('charset') if charset_meta else response.encoding
            print(f"字符編碼: {encoding}")

            # 3. 標籤統計
            all_tags = [tag.name for tag in soup.find_all()]
            tag_counter = Counter(all_tags)
            print(f"\n標籤統計 (前10個):")
            for tag, count in tag_counter.most_common(10):
                print(f"  {tag}: {count}個")

            # 4. 鏈接分析
            links = soup.find_all('a', href=True)
            print(f"\n鏈接分析:")
            print(f"  總鏈接數: {len(links)}")

            internal_links = []
            external_links = []

            for link in links:
                href = link['href']
                if href.startswith('http'):
                    if url in href:
                        internal_links.append(href)
                    else:
                        external_links.append(href)
                elif href.startswith('/'):
                    internal_links.append(href)

            print(f"  內部鏈接: {len(internal_links)}個")
            print(f"  外部鏈接: {len(external_links)}個")

            # 5. 圖片分析
            images = soup.find_all('img')
            print(f"\n圖片分析:")
            print(f"  圖片總數: {len(images)}")

            img_with_alt = [img for img in images if img.get('alt')]
            print(f"  有alt屬性: {len(img_with_alt)}個")

            # 6. 表單分析
            forms = soup.find_all('form')
            print(f"\n表單分析:")
            print(f"  表單總數: {len(forms)}")

            for i, form in enumerate(forms):
                method = form.get('method', 'GET').upper()
                action = form.get('action', '當前頁面')
                inputs = form.find_all(['input', 'select', 'textarea'])
                print(f"  表單{i+1}: {method} -> {action} ({len(inputs)}個字段)")

            # 7. 腳本和樣式
            scripts = soup.find_all('script')
            stylesheets = soup.find_all('link', rel='stylesheet')

            print(f"\n資源分析:")
            print(f"  JavaScript文件: {len(scripts)}個")
            print(f"  CSS樣式表: {len(stylesheets)}個")

            # 8. 結構層次
            print(f"\n頁面結構:")
            body = soup.find('body')
            if body:
                print_structure(body, level=0, max_level=3)

        else:
            print(f"請求失敗,狀態碼: {response.status_code}")

    except Exception as e:
        print(f"分析過程中出現錯誤: {e}")

def print_structure(element, level=0, max_level=3):
    """
    遞歸打印HTML結構
    """
    if level > max_level:
        return

    indent = "  " * level
    tag_name = element.name

    # 獲取重要屬性
    attrs = []
    if element.get('id'):
        attrs.append(f"id='{element['id']}'")
    if element.get('class'):
        classes = ' '.join(element['class'])
        attrs.append(f"class='{classes}'")

    attr_str = f" [{', '.join(attrs)}]" if attrs else ""
    print(f"{indent}<{tag_name}>{attr_str}")

    # 遞歸處理子元素
    for child in element.children:
        if hasattr(child, 'name') and child.name:
            print_structure(child, level + 1, max_level)

# 使用示例
if __name__ == "__main__":
    # 分析一個示例網頁
    analyze_html_structure("https://httpbin.org/html")

運行結果示例:

=== HTML結構分析: https://httpbin.org/html ===
頁面標題: Herman Melville - Moby-Dick
文檔類型: HTML5
字符編碼: utf-8

標籤統計 (前10個):
  p: 4個
  a: 3個
  h1: 1個
  body: 1個
  html: 1個
  head: 1個
  title: 1個

鏈接分析:
  總鏈接數: 3個
  內部鏈接: 0個
  外部鏈接: 3個

圖片分析:
  圖片總數: 0個
  有alt屬性: 0個

表單分析:
  表單總數: 0個

資源分析:
  JavaScript文件: 0個
  CSS樣式表: 0個

頁面結構:
<body>
  <h1>
  <p>
  <p>
  <p>
  <p>

CSS選擇器

CSS選擇器是定位HTML元素的強大工具,在網頁數據提取中起着關鍵作用。理解CSS選擇器語法對於精確定位目標元素至關重要。

基本選擇器:
- 標籤選擇器divpa
- 類選擇器.class-name
- ID選擇器#element-id
- 屬性選擇器[attribute="value"]

組合選擇器:
- 後代選擇器div p(div內的所有p元素)
- 子元素選擇器div > p(div的直接子p元素)
- 相鄰兄弟選擇器h1 + p(緊跟h1的p元素)
- 通用兄弟選擇器h1 ~ p(h1後的所有同級p元素)

僞類選擇器:
- :first-child:last-child:nth-child(n)
- :not(selector):contains(text)

讓我們通過實例來學習CSS選擇器的使用:

import requests
from bs4 import BeautifulSoup

def demonstrate_css_selectors():
    """
    演示CSS選擇器的使用
    """
    # 創建示例HTML
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>CSS選擇器示例</title>
    </head>
    <body>
        <div class="container">
            <h1 id="main-title">新聞列表</h1>
            <div class="news-section">
                <article class="news-item featured">
                    <h2>重要新聞標題1</h2>
                    <p class="summary">這是新聞摘要...</p>
                    <span class="date">2024-01-15</span>
                    <a href="/news/1" class="read-more">閱讀更多</a>
                </article>
                <article class="news-item">
                    <h2>普通新聞標題2</h2>
                    <p class="summary">這是另一個新聞摘要...</p>
                    <span class="date">2024-01-14</span>
                    <a href="/news/2" class="read-more">閱讀更多</a>
                </article>
                <article class="news-item">
                    <h2>普通新聞標題3</h2>
                    <p class="summary">第三個新聞摘要...</p>
                    <span class="date">2024-01-13</span>
                    <a href="/news/3" class="read-more">閱讀更多</a>
                </article>
            </div>
            <aside class="sidebar">
                <h3>熱門標籤</h3>
                <ul class="tag-list">
                    <li><a href="/tag/tech" data-category="technology">科技</a></li>
                    <li><a href="/tag/sports" data-category="sports">體育</a></li>
                    <li><a href="/tag/finance" data-category="finance">財經</a></li>
                </ul>
            </aside>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("=== CSS選擇器演示 ===")

    # 1. 基本選擇器
    print("\n1. 基本選擇器:")

    # 標籤選擇器
    h2_elements = soup.select('h2')
    print(f"所有h2標籤 ({len(h2_elements)}個):")
    for h2 in h2_elements:
        print(f"  - {h2.get_text().strip()}")

    # 類選擇器
    news_items = soup.select('.news-item')
    print(f"\n所有新聞項 ({len(news_items)}個):")
    for i, item in enumerate(news_items, 1):
        title = item.select_one('h2').get_text().strip()
        print(f"  {i}. {title}")

    # ID選擇器
    main_title = soup.select_one('#main-title')
    print(f"\n主標題: {main_title.get_text().strip()}")

    # 屬性選擇器
    tech_links = soup.select('a[data-category="technology"]')
    print(f"\n科技類鏈接 ({len(tech_links)}個):")
    for link in tech_links:
        print(f"  - {link.get_text().strip()} -> {link.get('href')}")

    # 2. 組合選擇器
    print("\n2. 組合選擇器:")

    # 後代選擇器
    container_links = soup.select('.container a')
    print(f"容器內所有鏈接 ({len(container_links)}個):")
    for link in container_links:
        text = link.get_text().strip()
        href = link.get('href', '#')
        print(f"  - {text} -> {href}")

    # 子元素選擇器
    direct_children = soup.select('.news-section > .news-item')
    print(f"\n新聞區域的直接子元素 ({len(direct_children)}個)")

    # 相鄰兄弟選擇器
    after_h2 = soup.select('h2 + p')
    print(f"\nh2後的相鄰p元素 ({len(after_h2)}個):")
    for p in after_h2:
        print(f"  - {p.get_text().strip()[:30]}...")

    # 3. 僞類選擇器
    print("\n3. 僞類選擇器:")

    # 第一個和最後一個子元素
    first_news = soup.select('.news-item:first-child')
    last_news = soup.select('.news-item:last-child')

    if first_news:
        first_title = first_news[0].select_one('h2').get_text().strip()
        print(f"第一個新聞: {first_title}")

    if last_news:
        last_title = last_news[0].select_one('h2').get_text().strip()
        print(f"最後一個新聞: {last_title}")

    # nth-child選擇器
    second_news = soup.select('.news-item:nth-child(2)')
    if second_news:
        second_title = second_news[0].select_one('h2').get_text().strip()
        print(f"第二個新聞: {second_title}")

    # 4. 複雜選擇器組合
    print("\n4. 複雜選擇器:")

    # 選擇特色新聞的標題
    featured_title = soup.select('.news-item.featured h2')
    if featured_title:
        print(f"特色新聞標題: {featured_title[0].get_text().strip()}")

    # 選擇包含特定文本的元素
    read_more_links = soup.select('a.read-more')
    print(f"'閱讀更多'鏈接 ({len(read_more_links)}個)")

    # 選擇具有特定屬性的元素
    category_links = soup.select('a[data-category]')
    print(f"有分類屬性的鏈接 ({len(category_links)}個):")
    for link in category_links:
        category = link.get('data-category')
        text = link.get_text().strip()
        print(f"  - {text} (分類: {category})")

# 實際網頁CSS選擇器應用
def extract_data_with_css_selectors(url):
    """
    使用CSS選擇器從實際網頁提取數據
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            print(f"\n=== 從 {url} 提取數據 ===")

            # 提取頁面標題
            title = soup.select_one('title')
            if title:
                print(f"頁面標題: {title.get_text().strip()}")

            # 提取所有鏈接
            links = soup.select('a[href]')
            print(f"\n找到 {len(links)} 個鏈接:")

            for i, link in enumerate(links[:5], 1):  # 只顯示前5個
                text = link.get_text().strip()
                href = link.get('href')
                print(f"  {i}. {text[:50]}... -> {href}")

            # 提取所有段落文本
            paragraphs = soup.select('p')
            if paragraphs:
                print(f"\n段落內容 (共{len(paragraphs)}個):")
                for i, p in enumerate(paragraphs[:3], 1):  # 只顯示前3個
                    text = p.get_text().strip()
                    if text:
                        print(f"  {i}. {text[:100]}...")
        else:
            print(f"請求失敗,狀態碼: {response.status_code}")

    except Exception as e:
        print(f"提取數據時出現錯誤: {e}")

# 運行演示
if __name__ == "__main__":
    demonstrate_css_selectors()
    extract_data_with_css_selectors("https://httpbin.org/html")

JavaScript和動態內容

現代網頁大量使用JavaScript來動態生成內容,這給傳統的靜態爬蟲帶來了挑戰。動態內容包括:

  1. AJAX加載的數據:通過異步請求獲取的內容
  2. JavaScript渲染的頁面:完全由JS生成的頁面結構
  3. 用戶交互觸發的內容:點擊、滾動等操作後顯示的內容
  4. 即時更新的數據:WebSocket或定時刷新的內容

處理動態內容的方法:

方法1:分析AJAX請求

import requests
import json

def analyze_ajax_requests():
    """
    分析和模擬AJAX請求
    """
    print("=== AJAX請求分析 ===")

    # 模擬一個AJAX請求
    ajax_url = "https://httpbin.org/json"

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'X-Requested-With': 'XMLHttpRequest',  # 標識AJAX請求
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/json'
    }

    try:
        response = requests.get(ajax_url, headers=headers)

        if response.status_code == 200:
            data = response.json()
            print(f"AJAX響應數據:")
            print(json.dumps(data, indent=2, ensure_ascii=False))
        else:
            print(f"AJAX請求失敗: {response.status_code}")

    except Exception as e:
        print(f"AJAX請求異常: {e}")

# 運行AJAX分析
if __name__ == "__main__":
    analyze_ajax_requests()

方法2:使用Selenium處理JavaScript

# 注意:需要安裝selenium和對應的瀏覽器驅動
# pip install selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

def handle_dynamic_content_with_selenium():
    """
    使用Selenium處理動態內容
    """
    print("=== Selenium處理動態內容 ===")

    # 配置Chrome選項
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # 無頭模式
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')

    try:
        # 創建WebDriver實例
        driver = webdriver.Chrome(options=chrome_options)

        # 訪問包含動態內容的頁面
        driver.get("https://httpbin.org/html")

        # 等待頁面加載完成
        wait = WebDriverWait(driver, 10)

        # 獲取頁面標題
        title = driver.title
        print(f"頁面標題: {title}")

        # 查找元素
        h1_element = wait.until(
            EC.presence_of_element_located((By.TAG_NAME, "h1"))
        )
        print(f"H1內容: {h1_element.text}")

        # 獲取所有鏈接
        links = driver.find_elements(By.TAG_NAME, "a")
        print(f"\n找到 {len(links)} 個鏈接:")

        for i, link in enumerate(links, 1):
            text = link.text.strip()
            href = link.get_attribute('href')
            print(f"  {i}. {text} -> {href}")

        # 執行JavaScript
        js_result = driver.execute_script("return document.title;")
        print(f"\nJavaScript執行結果: {js_result}")

    except Exception as e:
        print(f"Selenium處理異常: {e}")
    finally:
        if 'driver' in locals():
            driver.quit()

# 注意:實際運行需要安裝ChromeDriver
# 這裏只是演示代碼結構

網頁編碼和字符集

正確處理網頁編碼是避免亂碼問題的關鍵。常見的編碼格式包括:

  • UTF-8:支持全球所有字符的Unicode編碼
  • GBK/GB2312:中文編碼格式
  • ISO-8859-1:西歐字符編碼
  • ASCII:基本英文字符編碼
import requests
from bs4 import BeautifulSoup
import chardet

def handle_encoding_issues():
    """
    處理網頁編碼問題
    """
    print("=== 網頁編碼處理 ===")

    # 測試不同編碼的處理
    test_urls = [
        "https://httpbin.org/encoding/utf8",
        "https://httpbin.org/html",
    ]

    for url in test_urls:
        try:
            print(f"\n處理URL: {url}")

            # 獲取原始響應
            response = requests.get(url)

            print(f"響應編碼: {response.encoding}")
            print(f"表觀編碼: {response.apparent_encoding}")

            # 方法1:使用chardet檢測編碼
            detected_encoding = chardet.detect(response.content)
            print(f"檢測到的編碼: {detected_encoding}")

            # 方法2:從HTML meta標籤獲取編碼
            soup = BeautifulSoup(response.content, 'html.parser')

            # 查找charset聲明
            charset_meta = soup.find('meta', attrs={'charset': True})
            if charset_meta:
                declared_charset = charset_meta.get('charset')
                print(f"聲明的編碼: {declared_charset}")
            else:
                # 查找http-equiv類型的meta標籤
                content_type_meta = soup.find('meta', attrs={'http-equiv': 'Content-Type'})
                if content_type_meta:
                    content = content_type_meta.get('content', '')
                    if 'charset=' in content:
                        declared_charset = content.split('charset=')[1].split(';')[0]
                        print(f"聲明的編碼: {declared_charset}")

            # 方法3:正確設置編碼後重新解析
            if detected_encoding['encoding']:
                response.encoding = detected_encoding['encoding']
                soup = BeautifulSoup(response.text, 'html.parser')

                title = soup.find('title')
                if title:
                    print(f"正確編碼後的標題: {title.get_text().strip()}")

        except Exception as e:
            print(f"編碼處理異常: {e}")

def create_encoding_safe_crawler():
    """
    創建編碼安全的爬蟲
    """
    def safe_get_text(url, timeout=10):
        """
        安全獲取網頁文本內容
        """
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }

            response = requests.get(url, headers=headers, timeout=timeout)

            # 1. 首先嚐試使用響應頭中的編碼
            if response.encoding != 'ISO-8859-1':  # 避免錯誤的默認編碼
                soup = BeautifulSoup(response.text, 'html.parser')
            else:
                # 2. 使用chardet檢測編碼
                detected = chardet.detect(response.content)
                if detected['confidence'] > 0.7:  # 置信度閾值
                    response.encoding = detected['encoding']
                    soup = BeautifulSoup(response.text, 'html.parser')
                else:
                    # 3. 嘗試常見編碼
                    for encoding in ['utf-8', 'gbk', 'gb2312']:
                        try:
                            text = response.content.decode(encoding)
                            soup = BeautifulSoup(text, 'html.parser')
                            break
                        except UnicodeDecodeError:
                            continue
                    else:
                        # 4. 使用錯誤處理策略
                        soup = BeautifulSoup(response.content, 'html.parser', from_encoding='utf-8')

            return soup

        except Exception as e:
            print(f"獲取頁面內容失敗: {e}")
            return None

    # 測試編碼安全爬蟲
    test_url = "https://httpbin.org/html"
    soup = safe_get_text(test_url)

    if soup:
        title = soup.find('title')
        print(f"\n編碼安全爬蟲結果:")
        print(f"標題: {title.get_text().strip() if title else '無標題'}")

        # 提取文本內容
        paragraphs = soup.find_all('p')
        print(f"段落數量: {len(paragraphs)}")

        for i, p in enumerate(paragraphs[:2], 1):
            text = p.get_text().strip()
            print(f"段落{i}: {text[:100]}...")

# 運行編碼處理演示
if __name__ == "__main__":
    handle_encoding_issues()
    create_encoding_safe_crawler()

爬蟲開發環境

開發工具選擇

選擇合適的開發工具能夠顯著提高爬蟲開發效率:

IDE和編輯器:
- PyCharm:功能強大的Python IDE,支持調試和代碼分析
- VS Code:輕量級編輯器,豐富的插件生態
- Jupyter Notebook:適合數據分析和原型開發
- Sublime Text:快速的文本編輯器

瀏覽器開發者工具:
- Chrome DevTools:分析網頁結構、網絡請求、JavaScript執行
- Firefox Developer Tools:類似Chrome,某些功能更強大
- 網絡面板:查看HTTP請求和響應
- 元素面板:分析HTML結構和CSS樣式

抓包工具:
- Fiddler:Windows平臺的HTTP調試代理
- Charles:跨平臺的HTTP監控工具
- mitmproxy:基於Python的中間人代理
- Wireshark:網絡協議分析器

代理和IP池

使用代理服務器可以隱藏真實IP地址,避免被網站封禁:

import requests
import random
import time
from itertools import cycle

class ProxyManager:
    """
    代理管理器
    """
    def __init__(self):
        # 代理列表(示例,實際使用時需要有效的代理)
        self.proxy_list = [
            {'http': 'http://proxy1:port', 'https': 'https://proxy1:port'},
            {'http': 'http://proxy2:port', 'https': 'https://proxy2:port'},
            {'http': 'http://proxy3:port', 'https': 'https://proxy3:port'},
        ]
        self.proxy_cycle = cycle(self.proxy_list)
        self.failed_proxies = set()

    def get_proxy(self):
        """
        獲取可用代理
        """
        for _ in range(len(self.proxy_list)):
            proxy = next(self.proxy_cycle)
            proxy_key = str(proxy)

            if proxy_key not in self.failed_proxies:
                return proxy

        # 如果所有代理都失敗,清空失敗列表重新開始
        self.failed_proxies.clear()
        return next(self.proxy_cycle)

    def mark_proxy_failed(self, proxy):
        """
        標記代理失敗
        """
        self.failed_proxies.add(str(proxy))

    def test_proxy(self, proxy, test_url="https://httpbin.org/ip"):
        """
        測試代理是否可用
        """
        try:
            response = requests.get(
                test_url, 
                proxies=proxy, 
                timeout=10,
                headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
            )

            if response.status_code == 200:
                data = response.json()
                print(f"代理測試成功,IP: {data.get('origin')}")
                return True
            else:
                print(f"代理測試失敗,狀態碼: {response.status_code}")
                return False

        except Exception as e:
            print(f"代理測試異常: {e}")
            return False

def demonstrate_proxy_usage():
    """
    演示代理使用
    """
    print("=== 代理使用演示 ===")

    # 不使用代理的請求
    try:
        response = requests.get("https://httpbin.org/ip", timeout=10)
        if response.status_code == 200:
            data = response.json()
            print(f"直接訪問IP: {data.get('origin')}")
    except Exception as e:
        print(f"直接訪問失敗: {e}")

    # 使用代理的請求(示例)
    proxy_manager = ProxyManager()

    # 注意:以下代碼需要有效的代理服務器才能正常工作
    print("\n代理測試(需要有效代理):")
    for i in range(3):
        proxy = proxy_manager.get_proxy()
        print(f"測試代理 {i+1}: {proxy}")

        # 在實際環境中測試代理
        # is_working = proxy_manager.test_proxy(proxy)
        # if not is_working:
        #     proxy_manager.mark_proxy_failed(proxy)

# 免費代理獲取示例
def get_free_proxies():
    """
    獲取免費代理(示例)
    """
    print("\n=== 免費代理獲取 ===")

    # 這裏只是演示結構,實際需要從代理網站爬取
    free_proxy_sources = [
        "https://www.proxy-list.download/api/v1/get?type=http",
        "https://api.proxyscrape.com/v2/?request=get&protocol=http",
    ]

    proxies = []

    for source in free_proxy_sources:
        try:
            print(f"從 {source} 獲取代理...")
            # 實際實現需要解析不同網站的格式
            # response = requests.get(source, timeout=10)
            # 解析代理列表...
            print("代理獲取完成(示例)")

        except Exception as e:
            print(f"獲取代理失敗: {e}")

    return proxies

# 運行代理演示
if __name__ == "__main__":
    demonstrate_proxy_usage()
    get_free_proxies()

用戶代理設置

用戶代理(User-Agent)字符串標識客戶端應用程序,設置合適的User-Agent可以避免被識別爲爬蟲:

import requests
import random

class UserAgentManager:
    """
    用戶代理管理器
    """
    def __init__(self):
        self.user_agents = [
            # Chrome
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',

            # Firefox
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (X11; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0',

            # Safari
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
            'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',

            # Edge
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
        ]

    def get_random_user_agent(self):
        """
        獲取隨機用戶代理
        """
        return random.choice(self.user_agents)

    def get_mobile_user_agent(self):
        """
        獲取移動端用戶代理
        """
        mobile_agents = [
            'Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1',
            'Mozilla/5.0 (Android 14; Mobile; rv:121.0) Gecko/121.0 Firefox/121.0',
            'Mozilla/5.0 (Linux; Android 14; SM-G998B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36',
        ]
        return random.choice(mobile_agents)

def demonstrate_user_agent():
    """
    演示用戶代理的使用
    """
    print("=== 用戶代理演示 ===")

    ua_manager = UserAgentManager()

    # 測試不同的用戶代理
    test_url = "https://httpbin.org/user-agent"

    for i in range(3):
        user_agent = ua_manager.get_random_user_agent()
        headers = {'User-Agent': user_agent}

        try:
            response = requests.get(test_url, headers=headers)
            if response.status_code == 200:
                data = response.json()
                print(f"\n請求 {i+1}:")
                print(f"發送的User-Agent: {user_agent[:50]}...")
                print(f"服務器接收到的: {data.get('user-agent', '')[:50]}...")
        except Exception as e:
            print(f"請求失敗: {e}")

    # 測試移動端用戶代理
    print("\n=== 移動端用戶代理 ===")
    mobile_ua = ua_manager.get_mobile_user_agent()
    headers = {'User-Agent': mobile_ua}

    try:
        response = requests.get(test_url, headers=headers)
        if response.status_code == 200:
            data = response.json()
            print(f"移動端User-Agent: {data.get('user-agent')}")
    except Exception as e:
        print(f"移動端請求失敗: {e}")

# 運行用戶代理演示
if __name__ == "__main__":
    demonstrate_user_agent()

調試和測試工具

有效的調試和測試工具能夠幫助快速定位和解決爬蟲開發中的問題:

import requests
import time
import logging
from functools import wraps

# 配置日誌
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('crawler.log'),
        logging.StreamHandler()
    ]
)

def debug_request(func):
    """
    請求調試裝飾器
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()

        try:
            result = func(*args, **kwargs)
            end_time = time.time()

            logging.info(f"{func.__name__} 執行成功,耗時: {end_time - start_time:.3f}秒")
            return result

        except Exception as e:
            end_time = time.time()
            logging.error(f"{func.__name__} 執行失敗,耗時: {end_time - start_time:.3f}秒,錯誤: {e}")
            raise

    return wrapper

class CrawlerDebugger:
    """
    爬蟲調試器
    """
    def __init__(self):
        self.request_count = 0
        self.success_count = 0
        self.error_count = 0
        self.start_time = time.time()

    @debug_request
    def debug_get(self, url, **kwargs):
        """
        調試版本的GET請求
        """
        self.request_count += 1

        # 默認headers
        default_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }

        headers = kwargs.get('headers', {})
        headers.update(default_headers)
        kwargs['headers'] = headers

        logging.info(f"發送GET請求到: {url}")
        logging.debug(f"請求參數: {kwargs}")

        try:
            response = requests.get(url, **kwargs)

            logging.info(f"響應狀態碼: {response.status_code}")
            logging.info(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
            logging.debug(f"響應頭: {dict(response.headers)}")

            if response.status_code == 200:
                self.success_count += 1
            else:
                self.error_count += 1
                logging.warning(f"非200狀態碼: {response.status_code}")

            return response

        except requests.RequestException as e:
            self.error_count += 1
            logging.error(f"請求異常: {e}")
            raise

    def get_stats(self):
        """
        獲取統計信息
        """
        elapsed_time = time.time() - self.start_time

        stats = {
            '總請求數': self.request_count,
            '成功請求數': self.success_count,
            '失敗請求數': self.error_count,
            '成功率': f"{(self.success_count / max(self.request_count, 1)) * 100:.2f}%",
            '運行時間': f"{elapsed_time:.2f}秒",
            '平均請求速度': f"{self.request_count / max(elapsed_time, 1):.2f}請求/秒"
        }

        return stats

    def print_stats(self):
        """
        打印統計信息
        """
        stats = self.get_stats()

        print("\n=== 爬蟲統計信息 ===")
        for key, value in stats.items():
            print(f"{key}: {value}")

def test_crawler_debugger():
    """
    測試爬蟲調試器
    """
    debugger = CrawlerDebugger()

    test_urls = [
        "https://httpbin.org/get",
        "https://httpbin.org/status/200",
        "https://httpbin.org/delay/1",
        "https://httpbin.org/status/404",  # 這個會返回404
        "https://httpbin.org/json",
    ]

    print("開始測試爬蟲調試器...")

    for url in test_urls:
        try:
            response = debugger.debug_get(url, timeout=10)
            print(f"✓ {url} - 狀態碼: {response.status_code}")
        except Exception as e:
            print(f"✗ {url} - 錯誤: {e}")

        time.sleep(0.5)  # 避免請求過快

    # 打印統計信息
    debugger.print_stats()

# 性能測試工具
def performance_test(func, *args, **kwargs):
    """
    性能測試裝飾器
    """
    def test_performance(iterations=10):
        times = []

        for i in range(iterations):
            start_time = time.time()
            try:
                func(*args, **kwargs)
                end_time = time.time()
                times.append(end_time - start_time)
            except Exception as e:
                print(f"第{i+1}次測試失敗: {e}")

        if times:
            avg_time = sum(times) / len(times)
            min_time = min(times)
            max_time = max(times)

            print(f"\n=== 性能測試結果 ({iterations}次) ===")
            print(f"平均時間: {avg_time:.3f}秒")
            print(f"最短時間: {min_time:.3f}秒")
            print(f"最長時間: {max_time:.3f}秒")
            print(f"成功率: {len(times)}/{iterations} ({len(times)/iterations*100:.1f}%)")

    return test_performance

# 運行調試演示
if __name__ == "__main__":
    test_crawler_debugger()

    # 性能測試示例
    @performance_test
    def simple_request():
        response = requests.get("https://httpbin.org/get", timeout=5)
        return response.status_code == 200

    print("\n開始性能測試...")
    simple_request(iterations=5)

運行結果示例:

開始測試爬蟲調試器...
2024-01-15 14:30:15,123 - INFO - 發送GET請求到: https://httpbin.org/get
2024-01-15 14:30:15,456 - INFO - 響應狀態碼: 200
2024-01-15 14:30:15,456 - INFO - 響應時間: 0.333秒
2024-01-15 14:30:15,456 - INFO - debug_get 執行成功,耗時: 0.334秒
✓ https://httpbin.org/get - 狀態碼: 200

2024-01-15 14:30:16,001 - INFO - 發送GET請求到: https://httpbin.org/status/200
2024-01-15 14:30:16,234 - INFO - 響應狀態碼: 200
2024-01-15 14:30:16,234 - INFO - 響應時間: 0.233秒
2024-01-15 14:30:16,234 - INFO - debug_get 執行成功,耗時: 0.234秒
✓ https://httpbin.org/status/200 - 狀態碼: 200

=== 爬蟲統計信息 ===
總請求數: 5
成功請求數: 4
失敗請求數: 1
成功率: 80.00%
運行時間: 3.45秒
平均請求速度: 1.45請求/秒

=== 性能測試結果 (5次) ===
平均時間: 0.456秒
最短時間: 0.234秒
最長時間: 0.678秒
成功率: 5/5 (100.0%)

14.2 Requests庫網絡請求

Requests是Python中最受歡迎的HTTP庫,它讓HTTP請求變得簡單而優雅。相比於Python標準庫中的urllib,Requests提供了更加人性化的API,是網絡爬蟲開發的首選工具。

Requests基礎

安裝和基本使用

Requests庫的安裝非常簡單,使用pip命令即可:

pip install requests

安裝完成後,我們來看看Requests的基本使用方法:

import requests
import json
from pprint import pprint

def basic_requests_usage():
    """
    演示Requests的基本使用方法
    """
    print("=== Requests基礎使用演示 ===")

    # 1. 最簡單的GET請求
    print("\n1. 基本GET請求:")
    response = requests.get('https://httpbin.org/get')

    print(f"狀態碼: {response.status_code}")
    print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
    print(f"內容類型: {response.headers.get('content-type')}")

    # 2. 檢查請求是否成功
    if response.status_code == 200:
        print("請求成功!")
        data = response.json()  # 解析JSON響應
        print(f"服務器接收到的URL: {data['url']}")
    else:
        print(f"請求失敗,狀態碼: {response.status_code}")

    # 3. 使用raise_for_status()檢查狀態
    try:
        response.raise_for_status()  # 如果狀態碼不是200會拋出異常
        print("狀態檢查通過")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP錯誤: {e}")

    # 4. 獲取響應內容的不同方式
    print("\n2. 響應內容獲取:")

    # 文本內容
    print(f"響應文本長度: {len(response.text)}字符")

    # 二進制內容
    print(f"響應二進制長度: {len(response.content)}字節")

    # JSON內容(如果是JSON格式)
    try:
        json_data = response.json()
        print(f"JSON數據鍵: {list(json_data.keys())}")
    except ValueError:
        print("響應不是有效的JSON格式")

    # 5. 響應頭信息
    print("\n3. 響應頭信息:")
    print(f"服務器: {response.headers.get('server', '未知')}")
    print(f"內容長度: {response.headers.get('content-length', '未知')}")
    print(f"連接類型: {response.headers.get('connection', '未知')}")

# 運行基礎演示
if __name__ == "__main__":
    basic_requests_usage()

運行結果:

=== Requests基礎使用演示 ===

1. 基本GET請求:
狀態碼: 200
響應時間: 0.234秒
內容類型: application/json
請求成功!
服務器接收到的URL: https://httpbin.org/get
狀態檢查通過

2. 響應內容獲取:
響應文本長度: 312字符
響應二進制長度: 312字節
JSON數據鍵: ['args', 'headers', 'origin', 'url']

3. 響應頭信息:
服務器: gunicorn/19.9.0
內容長度: 312
連接類型: keep-alive
GET和POST請求

GET和POST是HTTP協議中最常用的兩種請求方法。GET用於獲取數據,POST用於提交數據。

GET請求詳解:

import requests
from urllib.parse import urlencode

def demonstrate_get_requests():
    """
    演示各種GET請求的使用方法
    """
    print("=== GET請求詳解 ===")

    # 1. 基本GET請求
    print("\n1. 基本GET請求:")
    response = requests.get('https://httpbin.org/get')
    print(f"請求URL: {response.url}")
    print(f"狀態碼: {response.status_code}")

    # 2. 帶參數的GET請求
    print("\n2. 帶參數的GET請求:")

    # 方法1: 使用params參數
    params = {
        'name': '張三',
        'age': 25,
        'city': '北京',
        'hobbies': ['讀書', '游泳']  # 列表參數
    }

    response = requests.get('https://httpbin.org/get', params=params)
    print(f"構建的URL: {response.url}")

    data = response.json()
    print(f"服務器接收到的參數: {data['args']}")

    # 方法2: 直接在URL中包含參數
    url_with_params = 'https://httpbin.org/get?name=李四&age=30'
    response2 = requests.get(url_with_params)
    print(f"\n直接URL參數: {response2.json()['args']}")

    # 3. 自定義請求頭
    print("\n3. 自定義請求頭:")
    headers = {
        'User-Agent': 'MySpider/1.0',
        'Accept': 'application/json',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Referer': 'https://www.example.com'
    }

    response = requests.get('https://httpbin.org/get', headers=headers)
    received_headers = response.json()['headers']

    print(f"發送的User-Agent: {headers['User-Agent']}")
    print(f"服務器接收到的User-Agent: {received_headers.get('User-Agent')}")

    # 4. 超時設置
    print("\n4. 超時設置:")
    try:
        # 設置連接超時爲3秒,讀取超時爲5秒
        response = requests.get('https://httpbin.org/delay/2', timeout=(3, 5))
        print(f"請求成功,耗時: {response.elapsed.total_seconds():.3f}秒")
    except requests.exceptions.Timeout:
        print("請求超時")
    except requests.exceptions.RequestException as e:
        print(f"請求異常: {e}")

    # 5. 處理重定向
    print("\n5. 重定向處理:")

    # 允許重定向(默認行爲)
    response = requests.get('https://httpbin.org/redirect/2')
    print(f"最終URL: {response.url}")
    print(f"重定向歷史: {[r.url for r in response.history]}")

    # 禁止重定向
    response_no_redirect = requests.get('https://httpbin.org/redirect/1', allow_redirects=False)
    print(f"\n禁止重定向狀態碼: {response_no_redirect.status_code}")
    print(f"Location頭: {response_no_redirect.headers.get('Location')}")

# 運行GET請求演示
if __name__ == "__main__":
    demonstrate_get_requests()

POST請求詳解:

import requests
import json

def demonstrate_post_requests():
    """
    演示各種POST請求的使用方法
    """
    print("=== POST請求詳解 ===")

    # 1. 發送表單數據
    print("\n1. 發送表單數據:")
    form_data = {
        'username': 'testuser',
        'password': 'testpass',
        'email': 'test@example.com',
        'remember': 'on'
    }

    response = requests.post('https://httpbin.org/post', data=form_data)

    if response.status_code == 200:
        result = response.json()
        print(f"發送的表單數據: {form_data}")
        print(f"服務器接收到的表單: {result['form']}")
        print(f"Content-Type: {result['headers'].get('Content-Type')}")

    # 2. 發送JSON數據
    print("\n2. 發送JSON數據:")
    json_data = {
        'name': '王五',
        'age': 28,
        'skills': ['Python', 'JavaScript', 'SQL'],
        'is_active': True,
        'profile': {
            'city': '上海',
            'experience': 5
        }
    }

    # 方法1: 使用json參數(推薦)
    response = requests.post('https://httpbin.org/post', json=json_data)

    if response.status_code == 200:
        result = response.json()
        print(f"發送的JSON數據: {json_data}")
        print(f"服務器接收到的JSON: {result['json']}")
        print(f"Content-Type: {result['headers'].get('Content-Type')}")

    # 方法2: 手動設置headers和data
    headers = {'Content-Type': 'application/json'}
    response2 = requests.post(
        'https://httpbin.org/post', 
        data=json.dumps(json_data), 
        headers=headers
    )
    print(f"\n手動設置方式狀態碼: {response2.status_code}")

    # 3. 發送文件
    print("\n3. 文件上傳:")

    # 創建一個臨時文件用於演示
    import tempfile
    import os

    # 創建臨時文件
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
        f.write("這是一個測試文件\n包含中文內容")
        temp_file_path = f.name

    try:
        # 上傳文件
        with open(temp_file_path, 'rb') as f:
            files = {'file': ('test.txt', f, 'text/plain')}
            response = requests.post('https://httpbin.org/post', files=files)

        if response.status_code == 200:
            result = response.json()
            print(f"上傳的文件信息: {result['files']}")
            print(f"Content-Type: {result['headers'].get('Content-Type')}")

    finally:
        # 清理臨時文件
        os.unlink(temp_file_path)

    # 4. 混合數據提交
    print("\n4. 混合數據提交:")

    # 同時發送表單數據和文件
    form_data = {'description': '文件描述', 'category': 'test'}

    # 創建內存中的文件對象
    from io import StringIO, BytesIO

    file_content = BytesIO(b"Hello, World! This is a test file.")
    files = {'upload': ('hello.txt', file_content, 'text/plain')}

    response = requests.post(
        'https://httpbin.org/post', 
        data=form_data, 
        files=files
    )

    if response.status_code == 200:
        result = response.json()
        print(f"表單數據: {result['form']}")
        print(f"文件數據: {list(result['files'].keys())}")

    # 5. 自定義請求頭的POST
    print("\n5. 自定義請求頭的POST:")

    headers = {
        'User-Agent': 'MyApp/2.0',
        'Authorization': 'Bearer your-token-here',
        'X-Custom-Header': 'custom-value'
    }

    data = {'message': 'Hello from custom headers'}

    response = requests.post(
        'https://httpbin.org/post', 
        json=data, 
        headers=headers
    )

    if response.status_code == 200:
        result = response.json()
        received_headers = result['headers']
        print(f"自定義頭部 X-Custom-Header: {received_headers.get('X-Custom-Header')}")
        print(f"Authorization: {received_headers.get('Authorization')}")

# 運行POST請求演示
if __name__ == "__main__":
    demonstrate_post_requests()

運行結果示例:

=== POST請求詳解 ===

1. 發送表單數據:
發送的表單數據: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
服務器接收到的表單: {'username': 'testuser', 'password': 'testpass', 'email': 'test@example.com', 'remember': 'on'}
Content-Type: application/x-www-form-urlencoded

2. 發送JSON數據:
發送的JSON數據: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
服務器接收到的JSON: {'name': '王五', 'age': 28, 'skills': ['Python', 'JavaScript', 'SQL'], 'is_active': True, 'profile': {'city': '上海', 'experience': 5}}
Content-Type: application/json

3. 文件上傳:
上傳的文件信息: {'file': '這是一個測試文件\n包含中文內容'}
Content-Type: multipart/form-data; boundary=...

4. 混合數據提交:
表單數據: {'description': '文件描述', 'category': 'test'}
文件數據: ['upload']

5. 自定義請求頭的POST:
自定義頭部 X-Custom-Header: custom-value
Authorization: Bearer your-token-here

請求參數和頭部

在網絡爬蟲中,正確設置請求參數和頭部信息是非常重要的,它們決定了服務器如何處理我們的請求。

請求參數詳解
import requests
from urllib.parse import urlencode, quote

def advanced_parameters_demo():
    """
    演示高級參數處理
    """
    print("=== 高級參數處理演示 ===")

    # 1. 複雜參數結構
    print("\n1. 複雜參數結構:")

    complex_params = {
        'q': 'Python爬蟲',  # 中文搜索詞
        'page': 1,
        'size': 20,
        'sort': ['time', 'relevance'],  # 多值參數
        'filters': {
            'category': 'tech',
            'date_range': '2024-01-01,2024-12-31'
        },
        'include_fields': ['title', 'content', 'author'],
        'exclude_empty': True
    }

    # Requests會自動處理複雜參數
    response = requests.get('https://httpbin.org/get', params=complex_params)

    print(f"構建的URL: {response.url}")

    result = response.json()
    print(f"\n服務器接收到的參數:")
    for key, value in result['args'].items():
        print(f"  {key}: {value}")

    # 2. 手動URL編碼
    print("\n2. 手動URL編碼:")

    # 處理特殊字符
    special_params = {
        'query': 'hello world & python',
        'symbols': '!@#$%^&*()+={}[]|\\:;"<>?,./'
    }

    # 方法1: 使用requests自動編碼
    response1 = requests.get('https://httpbin.org/get', params=special_params)
    print(f"自動編碼URL: {response1.url}")

    # 方法2: 手動編碼
    encoded_query = quote('hello world & python')
    manual_url = f'https://httpbin.org/get?query={encoded_query}'
    response2 = requests.get(manual_url)
    print(f"手動編碼URL: {response2.url}")

    # 3. 數組參數的不同處理方式
    print("\n3. 數組參數處理:")

    # 方式1: Python列表(默認行爲)
    list_params = {'tags': ['python', 'web', 'crawler']}
    response = requests.get('https://httpbin.org/get', params=list_params)
    print(f"列表參數URL: {response.url}")

    # 方式2: 手動構建重複參數
    manual_params = [('tags', 'python'), ('tags', 'web'), ('tags', 'crawler')]
    response2 = requests.get('https://httpbin.org/get', params=manual_params)
    print(f"手動重複參數URL: {response2.url}")

    # 4. 條件參數構建
    print("\n4. 條件參數構建:")

    def build_search_params(keyword, page=1, filters=None, sort_by=None):
        """
        根據條件構建搜索參數
        """
        params = {'q': keyword, 'page': page}

        if filters:
            for key, value in filters.items():
                if value:  # 只添加非空值
                    params[f'filter_{key}'] = value

        if sort_by:
            params['sort'] = sort_by

        return params

    # 使用條件參數構建
    search_filters = {
        'category': 'technology',
        'author': '',  # 空值,不會被添加
        'date': '2024-01-01'
    }

    params = build_search_params(
        keyword='Python教程',
        page=2,
        filters=search_filters,
        sort_by='date_desc'
    )

    response = requests.get('https://httpbin.org/get', params=params)
    print(f"條件構建的參數: {response.json()['args']}")

# 運行參數演示
if __name__ == "__main__":
    advanced_parameters_demo()
請求頭部詳解
import requests
import time
import random

def advanced_headers_demo():
    """
    演示高級請求頭處理
    """
    print("=== 高級請求頭演示 ===")

    # 1. 完整的瀏覽器請求頭模擬
    print("\n1. 完整瀏覽器頭部模擬:")

    browser_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',  # Do Not Track
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0'
    }

    response = requests.get('https://httpbin.org/get', headers=browser_headers)
    received_headers = response.json()['headers']

    print(f"發送的User-Agent: {browser_headers['User-Agent'][:50]}...")
    print(f"服務器接收的User-Agent: {received_headers.get('User-Agent', '')[:50]}...")
    print(f"Accept-Language: {received_headers.get('Accept-Language')}")

    # 2. API請求頭
    print("\n2. API請求頭:")

    api_headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json',
        'Authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...',
        'X-API-Key': 'your-api-key-here',
        'X-Client-Version': '1.2.3',
        'X-Request-ID': f'req_{int(time.time())}_{random.randint(1000, 9999)}'
    }

    data = {'query': 'test data'}
    response = requests.post('https://httpbin.org/post', json=data, headers=api_headers)

    if response.status_code == 200:
        result = response.json()
        print(f"API請求成功")
        print(f"Request ID: {result['headers'].get('X-Request-ID')}")
        print(f"Authorization: {result['headers'].get('Authorization', '')[:20]}...")

    # 3. 防爬蟲頭部設置
    print("\n3. 防爬蟲頭部設置:")

    # 模擬真實瀏覽器行爲
    anti_bot_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Referer': 'https://www.google.com/',  # 模擬從搜索引擎來
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Pragma': 'no-cache',
        'Cache-Control': 'no-cache'
    }

    response = requests.get('https://httpbin.org/get', headers=anti_bot_headers)
    print(f"防爬蟲請求狀態: {response.status_code}")
    print(f"Referer頭: {response.json()['headers'].get('Referer')}")

    # 4. 動態頭部生成
    print("\n4. 動態頭部生成:")

    def generate_dynamic_headers():
        """
        生成動態請求頭
        """
        user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0'
        ]

        referers = [
            'https://www.google.com/',
            'https://www.bing.com/',
            'https://www.baidu.com/',
            'https://duckduckgo.com/'
        ]

        return {
            'User-Agent': random.choice(user_agents),
            'Referer': random.choice(referers),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'X-Forwarded-For': f'{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}'
        }

    # 使用動態頭部發送多個請求
    for i in range(3):
        headers = generate_dynamic_headers()
        response = requests.get('https://httpbin.org/get', headers=headers)

        if response.status_code == 200:
            result = response.json()
            print(f"\n請求 {i+1}:")
            print(f"  User-Agent: {result['headers'].get('User-Agent', '')[:40]}...")
            print(f"  Referer: {result['headers'].get('Referer')}")
            print(f"  X-Forwarded-For: {result['headers'].get('X-Forwarded-For')}")

    # 5. 頭部優先級和覆蓋
    print("\n5. 頭部優先級演示:")

    # 創建會話並設置默認頭部
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'DefaultAgent/1.0',
        'Accept': 'application/json',
        'X-Default-Header': 'default-value'
    })

    # 請求時覆蓋部分頭部
    override_headers = {
        'User-Agent': 'OverrideAgent/2.0',  # 覆蓋默認值
        'X-Custom-Header': 'custom-value'   # 新增頭部
    }

    response = session.get('https://httpbin.org/get', headers=override_headers)

    if response.status_code == 200:
        result = response.json()
        headers = result['headers']
        print(f"最終User-Agent: {headers.get('User-Agent')}")
        print(f"默認Accept: {headers.get('Accept')}")
        print(f"默認頭部: {headers.get('X-Default-Header')}")
        print(f"自定義頭部: {headers.get('X-Custom-Header')}")

# 運行頭部演示
if __name__ == "__main__":
    advanced_headers_demo()

響應對象處理

響應對象包含了服務器返回的所有信息,正確處理響應對象是爬蟲開發的關鍵技能。

import requests
import json
from datetime import datetime

def response_handling_demo():
    """
    演示響應對象的各種處理方法
    """
    print("=== 響應對象處理演示 ===")

    # 發送一個測試請求
    response = requests.get('https://httpbin.org/json')

    # 1. 基本響應信息
    print("\n1. 基本響應信息:")
    print(f"狀態碼: {response.status_code}")
    print(f"狀態描述: {response.reason}")
    print(f"請求URL: {response.url}")
    print(f"響應時間: {response.elapsed.total_seconds():.3f}秒")
    print(f"編碼: {response.encoding}")

    # 2. 響應頭詳細分析
    print("\n2. 響應頭分析:")
    print(f"Content-Type: {response.headers.get('content-type')}")
    print(f"Content-Length: {response.headers.get('content-length')}")
    print(f"Server: {response.headers.get('server')}")
    print(f"Date: {response.headers.get('date')}")

    # 檢查是否支持壓縮
    content_encoding = response.headers.get('content-encoding')
    if content_encoding:
        print(f"內容編碼: {content_encoding}")
    else:
        print("未使用內容壓縮")

    # 3. 響應內容的不同獲取方式
    print("\n3. 響應內容獲取:")

    # 文本內容
    text_content = response.text
    print(f"文本內容長度: {len(text_content)}字符")
    print(f"文本內容預覽: {text_content[:100]}...")

    # 二進制內容
    binary_content = response.content
    print(f"二進制內容長度: {len(binary_content)}字節")

    # JSON內容
    try:
        json_content = response.json()
        print(f"JSON內容類型: {type(json_content)}")
        if isinstance(json_content, dict):
            print(f"JSON鍵: {list(json_content.keys())}")
    except ValueError as e:
        print(f"JSON解析失敗: {e}")

    # 4. 響應狀態檢查
    print("\n4. 響應狀態檢查:")

    def check_response_status(response):
        """
        檢查響應狀態的詳細信息
        """
        print(f"狀態碼: {response.status_code}")

        # 使用內置方法檢查狀態
        if response.ok:
            print("✓ 請求成功 (狀態碼 200-299)")
        else:
            print("✗ 請求失敗")

        # 詳細狀態分類
        if 200 <= response.status_code < 300:
            print("✓ 成功響應")
        elif 300 <= response.status_code < 400:
            print("→ 重定向響應")
            location = response.headers.get('location')
            if location:
                print(f"  重定向到: {location}")
        elif 400 <= response.status_code < 500:
            print("✗ 客戶端錯誤")
        elif 500 <= response.status_code < 600:
            print("✗ 服務器錯誤")

        # 使用raise_for_status檢查
        try:
            response.raise_for_status()
            print("✓ 狀態檢查通過")
        except requests.exceptions.HTTPError as e:
            print(f"✗ 狀態檢查失敗: {e}")

    check_response_status(response)

    # 5. 測試不同狀態碼的響應
    print("\n5. 不同狀態碼測試:")

    test_urls = [
        ('https://httpbin.org/status/200', '成功'),
        ('https://httpbin.org/status/404', '未找到'),
        ('https://httpbin.org/status/500', '服務器錯誤'),
        ('https://httpbin.org/redirect/1', '重定向')
    ]

    for url, description in test_urls:
        try:
            resp = requests.get(url, timeout=5)
            print(f"\n{description} ({url}):")
            print(f"  狀態碼: {resp.status_code}")
            print(f"  最終URL: {resp.url}")
            if resp.history:
                print(f"  重定向歷史: {[r.status_code for r in resp.history]}")
        except requests.exceptions.RequestException as e:
            print(f"\n{description} 請求失敗: {e}")

    # 6. 響應內容類型處理
    print("\n6. 不同內容類型處理:")

    def handle_different_content_types():
        """
        處理不同類型的響應內容
        """
        # JSON響應
        json_resp = requests.get('https://httpbin.org/json')
        if json_resp.headers.get('content-type', '').startswith('application/json'):
            data = json_resp.json()
            print(f"JSON數據: {data}")

        # HTML響應
        html_resp = requests.get('https://httpbin.org/html')
        if 'text/html' in html_resp.headers.get('content-type', ''):
            print(f"HTML內容長度: {len(html_resp.text)}字符")
            # 可以使用BeautifulSoup進一步解析

        # XML響應
        xml_resp = requests.get('https://httpbin.org/xml')
        if 'application/xml' in xml_resp.headers.get('content-type', ''):
            print(f"XML內容長度: {len(xml_resp.text)}字符")

        # 圖片響應(二進制)
        try:
            img_resp = requests.get('https://httpbin.org/image/png', timeout=10)
            if img_resp.headers.get('content-type', '').startswith('image/'):
                print(f"圖片大小: {len(img_resp.content)}字節")
                print(f"圖片類型: {img_resp.headers.get('content-type')}")
        except requests.exceptions.RequestException:
            print("圖片請求失敗或超時")

    handle_different_content_types()

    # 7. 響應時間和性能分析
    print("\n7. 響應時間分析:")

    def analyze_response_performance(url, num_requests=3):
        """
        分析響應性能
        """
        times = []

        for i in range(num_requests):
            start_time = datetime.now()
            try:
                resp = requests.get(url, timeout=10)
                end_time = datetime.now()

                # 計算總時間
                total_time = (end_time - start_time).total_seconds()
                # 獲取requests內部計時
                elapsed_time = resp.elapsed.total_seconds()

                times.append({
                    'total': total_time,
                    'elapsed': elapsed_time,
                    'status': resp.status_code
                })

                print(f"請求 {i+1}: {elapsed_time:.3f}秒 (狀態碼: {resp.status_code})")

            except requests.exceptions.RequestException as e:
                print(f"請求 {i+1} 失敗: {e}")

        if times:
            avg_time = sum(t['elapsed'] for t in times) / len(times)
            min_time = min(t['elapsed'] for t in times)
            max_time = max(t['elapsed'] for t in times)

            print(f"\n性能統計:")
            print(f"  平均響應時間: {avg_time:.3f}秒")
            print(f"  最快響應時間: {min_time:.3f}秒")
            print(f"  最慢響應時間: {max_time:.3f}秒")

    analyze_response_performance('https://httpbin.org/delay/1')

# 運行響應處理演示
if __name__ == "__main__":
    response_handling_demo()

運行結果示例:

=== 響應對象處理演示 ===

1. 基本響應信息:
狀態碼: 200
狀態描述: OK
請求URL: https://httpbin.org/json
響應時間: 0.234秒
編碼: utf-8

2. 響應頭分析:
Content-Type: application/json
Content-Length: 429
Server: gunicorn/19.9.0
Date: Mon, 15 Jan 2024 06:30:15 GMT
未使用內容壓縮

3. 響應內容獲取:
文本內容長度: 429字符
文本內容預覽: {"slideshow": {"author": "Yours Truly", "date": "date of publication", "slides": [{"title": "Wake up to WonderWidgets!", "type": "all"}, {"title": "Overview", "type": "all", "items": ["Why <em>WonderWidgets</em> are great", "Who <em>buys</em> them"]}], "title": "Sample Slide Show"}}...
二進制內容長度: 429字節
JSON內容類型: <class 'dict'>
JSON鍵: ['slideshow']

4. 響應狀態檢查:
狀態碼: 200
✓ 請求成功 (狀態碼 200-299)
✓ 成功響應
✓ 狀態檢查通過

5. 不同狀態碼測試:

成功 (https://httpbin.org/status/200):
  狀態碼: 200
  最終URL: https://httpbin.org/status/200

未找到 (https://httpbin.org/status/404):
  狀態碼: 404
  最終URL: https://httpbin.org/status/404

服務器錯誤 (https://httpbin.org/status/500):
  狀態碼: 500
  最終URL: https://httpbin.org/status/500

重定向 (https://httpbin.org/redirect/1):
  狀態碼: 200
  最終URL: https://httpbin.org/get
  重定向歷史: [302]

7. 響應時間分析:
請求 1: 1.234秒 (狀態碼: 200)
請求 2: 1.156秒 (狀態碼: 200)
請求 3: 1.298秒 (狀態碼: 200)

性能統計:
  平均響應時間: 1.229秒
  最快響應時間: 1.156秒
  最慢響應時間: 1.298秒

高級功能

Session會話管理

Session對象允許你跨請求保持某些參數,它會在同一個Session實例發出的所有請求之間保持cookie,使用urllib3的連接池,所以如果你向同一主機發送多個請求,底層的TCP連接將會被重用,從而帶來顯著的性能提升。

import requests
import time
from datetime import datetime

def session_management_demo():
    """
    演示Session會話管理的各種功能
    """
    print("=== Session會話管理演示 ===")

    # 1. 基本Session使用
    print("\n1. 基本Session使用:")

    # 創建Session對象
    session = requests.Session()

    # 設置Session級別的請求頭
    session.headers.update({
        'User-Agent': 'MyApp/1.0',
        'Accept': 'application/json'
    })

    # 使用Session發送請求
    response1 = session.get('https://httpbin.org/get')
    print(f"第一次請求狀態碼: {response1.status_code}")
    print(f"User-Agent: {response1.json()['headers'].get('User-Agent')}")

    # Session會保持設置的頭部
    response2 = session.get('https://httpbin.org/headers')
    print(f"第二次請求User-Agent: {response2.json()['headers'].get('User-Agent')}")

    # 2. Cookie持久化
    print("\n2. Cookie持久化演示:")

    # 創建新的Session
    cookie_session = requests.Session()

    # 第一次請求設置cookie
    response = cookie_session.get('https://httpbin.org/cookies/set/session_id/abc123')
    print(f"設置Cookie後的狀態碼: {response.status_code}")

    # 查看Session中的cookies
    print(f"Session中的Cookies: {dict(cookie_session.cookies)}")

    # 第二次請求會自動攜帶cookie
    response = cookie_session.get('https://httpbin.org/cookies')
    cookies_data = response.json()
    print(f"服務器接收到的Cookies: {cookies_data.get('cookies', {})}")

    # 3. 連接池和性能優化
    print("\n3. 連接池性能對比:")

    def test_without_session(num_requests=5):
        """不使用Session的請求"""
        start_time = time.time()
        for i in range(num_requests):
            response = requests.get('https://httpbin.org/get')
            if response.status_code != 200:
                print(f"請求 {i+1} 失敗")
        end_time = time.time()
        return end_time - start_time

    def test_with_session(num_requests=5):
        """使用Session的請求"""
        start_time = time.time()
        session = requests.Session()
        for i in range(num_requests):
            response = session.get('https://httpbin.org/get')
            if response.status_code != 200:
                print(f"請求 {i+1} 失敗")
        session.close()
        end_time = time.time()
        return end_time - start_time

    print("\n性能測試 (5次請求):")
    time_without_session = test_without_session()
    time_with_session = test_with_session()

    print(f"不使用Session: {time_without_session:.3f}秒")
    print(f"使用Session: {time_with_session:.3f}秒")
    print(f"性能提升: {((time_without_session - time_with_session) / time_without_session * 100):.1f}%")

    # 4. Session配置和自定義
    print("\n4. Session配置:")

    # 創建自定義配置的Session
    custom_session = requests.Session()

    # 設置默認超時
    custom_session.timeout = 10

    # 設置默認參數
    custom_session.params = {'api_key': 'your-api-key'}

    # 設置默認頭部
    custom_session.headers.update({
        'User-Agent': 'CustomBot/2.0',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Connection': 'keep-alive'
    })

    # 發送請求
    response = custom_session.get('https://httpbin.org/get', params={'extra': 'param'})

    if response.status_code == 200:
        data = response.json()
        print(f"最終URL: {response.url}")
        print(f"合併後的參數: {data.get('args', {})}")
        print(f"請求頭: {data.get('headers', {}).get('User-Agent')}")

    # 5. Session的請求鉤子
    print("\n5. 請求鉤子演示:")

    def log_request_hook(response, *args, **kwargs):
        """請求日誌鉤子"""
        print(f"[鉤子] 請求: {response.request.method} {response.url}")
        print(f"[鉤子] 狀態碼: {response.status_code}")
        print(f"[鉤子] 響應時間: {response.elapsed.total_seconds():.3f}秒")

    # 創建帶鉤子的Session
    hook_session = requests.Session()
    hook_session.hooks['response'].append(log_request_hook)

    # 發送請求,鉤子會自動執行
    print("\n發送帶鉤子的請求:")
    response = hook_session.get('https://httpbin.org/delay/1')

    # 6. Session上下文管理
    print("\n6. Session上下文管理:")

    # 使用with語句自動管理Session生命週期
    with requests.Session() as s:
        s.headers.update({'User-Agent': 'ContextManager/1.0'})

        response = s.get('https://httpbin.org/get')
        print(f"上下文管理器請求狀態: {response.status_code}")
        print(f"User-Agent: {response.json()['headers'].get('User-Agent')}")
    # Session會自動關閉

    # 7. Session錯誤處理
    print("\n7. Session錯誤處理:")

    error_session = requests.Session()

    # 設置重試適配器
    from requests.adapters import HTTPAdapter
    from urllib3.util.retry import Retry

    retry_strategy = Retry(
        total=3,  # 總重試次數
        backoff_factor=1,  # 重試間隔
        status_forcelist=[429, 500, 502, 503, 504],  # 需要重試的狀態碼
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    error_session.mount("http://", adapter)
    error_session.mount("https://", adapter)

    try:
        # 測試重試機制
        response = error_session.get('https://httpbin.org/status/500', timeout=5)
        print(f"重試後狀態碼: {response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"請求最終失敗: {e}")

    # 8. Session狀態管理
    print("\n8. Session狀態管理:")

    state_session = requests.Session()

    # 模擬登錄流程
    login_data = {
        'username': 'testuser',
        'password': 'testpass'
    }

    # 第一步:獲取登錄頁面(可能包含CSRF token)
    login_page = state_session.get('https://httpbin.org/get')
    print(f"獲取登錄頁面: {login_page.status_code}")

    # 第二步:提交登錄信息
    login_response = state_session.post('https://httpbin.org/post', data=login_data)
    print(f"登錄請求: {login_response.status_code}")

    # 第三步:訪問需要認證的頁面
    protected_response = state_session.get('https://httpbin.org/get')
    print(f"訪問受保護頁面: {protected_response.status_code}")

    # Session會自動維護整個會話狀態
    print(f"會話中的Cookie數量: {len(state_session.cookies)}")

# 運行Session演示
if __name__ == "__main__":
    session_management_demo()

身份驗證

Requests支持多種身份驗證方式,包括基本認證、摘要認證、OAuth等。

import requests
from requests.auth import HTTPBasicAuth, HTTPDigestAuth
import base64
import hashlib
import time

def authentication_demo():
    """
    演示各種身份驗證方式
    """
    print("=== 身份驗證演示 ===")

    # 1. HTTP基本認證 (Basic Authentication)
    print("\n1. HTTP基本認證:")

    # 方法1: 使用auth參數
    response = requests.get(
        'https://httpbin.org/basic-auth/user/pass',
        auth=('user', 'pass')
    )
    print(f"基本認證狀態碼: {response.status_code}")
    if response.status_code == 200:
        print(f"認證成功: {response.json()}")

    # 方法2: 使用HTTPBasicAuth類
    response2 = requests.get(
        'https://httpbin.org/basic-auth/testuser/testpass',
        auth=HTTPBasicAuth('testuser', 'testpass')
    )
    print(f"HTTPBasicAuth狀態碼: {response2.status_code}")

    # 方法3: 手動設置Authorization頭
    credentials = base64.b64encode(b'user:pass').decode('ascii')
    headers = {'Authorization': f'Basic {credentials}'}
    response3 = requests.get(
        'https://httpbin.org/basic-auth/user/pass',
        headers=headers
    )
    print(f"手動設置頭部狀態碼: {response3.status_code}")

    # 2. HTTP摘要認證 (Digest Authentication)
    print("\n2. HTTP摘要認證:")

    try:
        response = requests.get(
            'https://httpbin.org/digest-auth/auth/user/pass',
            auth=HTTPDigestAuth('user', 'pass')
        )
        print(f"摘要認證狀態碼: {response.status_code}")
        if response.status_code == 200:
            print(f"摘要認證成功: {response.json()}")
    except Exception as e:
        print(f"摘要認證失敗: {e}")

    # 3. Bearer Token認證
    print("\n3. Bearer Token認證:")

    # 模擬JWT token
    token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"

    headers = {'Authorization': f'Bearer {token}'}
    response = requests.get('https://httpbin.org/bearer', headers=headers)

    print(f"Bearer Token狀態碼: {response.status_code}")
    if response.status_code == 200:
        print(f"Token認證成功: {response.json()}")

    # 4. API Key認證
    print("\n4. API Key認證:")

    # 方法1: 在URL參數中
    api_key = "your-api-key-here"
    response = requests.get(
        'https://httpbin.org/get',
        params={'api_key': api_key}
    )
    print(f"URL參數API Key: {response.json()['args']}")

    # 方法2: 在請求頭中
    headers = {'X-API-Key': api_key}
    response2 = requests.get('https://httpbin.org/get', headers=headers)
    print(f"請求頭API Key: {response2.json()['headers'].get('X-Api-Key')}")

    # 5. 自定義認證類
    print("\n5. 自定義認證類:")

    class CustomAuth(requests.auth.AuthBase):
        """自定義認證類"""

        def __init__(self, api_key, secret_key):
            self.api_key = api_key
            self.secret_key = secret_key

        def __call__(self, r):
            # 生成時間戳
            timestamp = str(int(time.time()))

            # 生成簽名
            string_to_sign = f"{r.method}\n{r.url}\n{timestamp}"
            signature = hashlib.sha256(
                (string_to_sign + self.secret_key).encode('utf-8')
            ).hexdigest()

            # 添加認證頭
            r.headers['X-API-Key'] = self.api_key
            r.headers['X-Timestamp'] = timestamp
            r.headers['X-Signature'] = signature

            return r

    # 使用自定義認證
    custom_auth = CustomAuth('my-api-key', 'my-secret-key')
    response = requests.get('https://httpbin.org/get', auth=custom_auth)

    if response.status_code == 200:
        headers = response.json()['headers']
        print(f"自定義認證頭部:")
        print(f"  X-API-Key: {headers.get('X-Api-Key')}")
        print(f"  X-Timestamp: {headers.get('X-Timestamp')}")
        print(f"  X-Signature: {headers.get('X-Signature', '')[:20]}...")

    # 6. OAuth 2.0 模擬
    print("\n6. OAuth 2.0 模擬:")

    def oauth2_flow_simulation():
        """模擬OAuth 2.0授權流程"""

        # 第一步: 獲取授權碼 (實際應用中用戶會被重定向到授權服務器)
        auth_url = "https://httpbin.org/get"
        auth_params = {
            'response_type': 'code',
            'client_id': 'your-client-id',
            'redirect_uri': 'https://yourapp.com/callback',
            'scope': 'read write',
            'state': 'random-state-string'
        }

        print(f"授權URL: {auth_url}?{'&'.join([f'{k}={v}' for k, v in auth_params.items()])}")

        # 第二步: 使用授權碼獲取訪問令牌
        token_data = {
            'grant_type': 'authorization_code',
            'code': 'received-auth-code',
            'redirect_uri': 'https://yourapp.com/callback',
            'client_id': 'your-client-id',
            'client_secret': 'your-client-secret'
        }

        # 模擬獲取token
        token_response = requests.post('https://httpbin.org/post', data=token_data)
        print(f"Token請求狀態: {token_response.status_code}")

        # 第三步: 使用訪問令牌訪問API
        access_token = "mock-access-token-12345"
        api_headers = {'Authorization': f'Bearer {access_token}'}

        api_response = requests.get('https://httpbin.org/get', headers=api_headers)
        print(f"API訪問狀態: {api_response.status_code}")

        return access_token

    oauth_token = oauth2_flow_simulation()

    # 7. 會話級認證
    print("\n7. 會話級認證:")

    # 創建帶認證的Session
    auth_session = requests.Session()
    auth_session.auth = ('session_user', 'session_pass')

    # 所有通過這個Session的請求都會自動包含認證信息
    response1 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
    print(f"會話認證請求1: {response1.status_code}")

    response2 = auth_session.get('https://httpbin.org/basic-auth/session_user/session_pass')
    print(f"會話認證請求2: {response2.status_code}")

    # 8. 認證錯誤處理
    print("\n8. 認證錯誤處理:")

    def handle_auth_errors():
        """處理認證相關錯誤"""

        # 測試錯誤的認證信息
        try:
            response = requests.get(
                'https://httpbin.org/basic-auth/user/pass',
                auth=('wrong_user', 'wrong_pass'),
                timeout=5
            )

            if response.status_code == 401:
                print("✗ 認證失敗: 用戶名或密碼錯誤")
                print(f"  WWW-Authenticate: {response.headers.get('WWW-Authenticate')}")
            elif response.status_code == 403:
                print("✗ 訪問被拒絕: 權限不足")
            else:
                print(f"認證狀態: {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"認證請求異常: {e}")

    handle_auth_errors()

# 運行認證演示
if __name__ == "__main__":
    authentication_demo()

代理設置和SSL配置

在爬蟲開發中,代理和SSL配置是非常重要的功能,可以幫助我們繞過網絡限制和確保安全通信。

import requests
import ssl
from requests.adapters import HTTPAdapter
from urllib3.util.ssl_ import create_urllib3_context

def proxy_and_ssl_demo():
    """
    演示代理設置和SSL配置
    """
    print("=== 代理設置和SSL配置演示 ===")

    # 1. HTTP代理設置
    print("\n1. HTTP代理設置:")

    # 基本代理設置
    proxies = {
        'http': 'http://proxy.example.com:8080',
        'https': 'https://proxy.example.com:8080'
    }

    # 注意:這裏使用示例代理,實際運行時需要替換爲真實代理
    print(f"配置的代理: {proxies}")

    # 帶認證的代理
    auth_proxies = {
        'http': 'http://username:password@proxy.example.com:8080',
        'https': 'https://username:password@proxy.example.com:8080'
    }
    print(f"帶認證的代理: {auth_proxies}")

    # 2. SOCKS代理設置
    print("\n2. SOCKS代理設置:")

    # 需要安裝: pip install requests[socks]
    socks_proxies = {
        'http': 'socks5://127.0.0.1:1080',
        'https': 'socks5://127.0.0.1:1080'
    }
    print(f"SOCKS代理配置: {socks_proxies}")

    # 3. 代理輪換
    print("\n3. 代理輪換演示:")

    import random

    proxy_list = [
        {'http': 'http://proxy1.example.com:8080', 'https': 'https://proxy1.example.com:8080'},
        {'http': 'http://proxy2.example.com:8080', 'https': 'https://proxy2.example.com:8080'},
        {'http': 'http://proxy3.example.com:8080', 'https': 'https://proxy3.example.com:8080'}
    ]

    def get_random_proxy():
        """獲取隨機代理"""
        return random.choice(proxy_list)

    # 模擬使用不同代理發送請求
    for i in range(3):
        proxy = get_random_proxy()
        print(f"請求 {i+1} 使用代理: {proxy['http']}")
        # 實際請求代碼:
        # response = requests.get('https://httpbin.org/ip', proxies=proxy, timeout=10)

    # 4. 代理驗證和測試
    print("\n4. 代理驗證:")

    def test_proxy(proxy_dict, test_url='https://httpbin.org/ip'):
        """測試代理是否可用"""
        try:
            response = requests.get(
                test_url,
                proxies=proxy_dict,
                timeout=10
            )

            if response.status_code == 200:
                ip_info = response.json()
                print(f"✓ 代理可用")
                print(f"  出口IP: {ip_info.get('origin')}")
                print(f"  響應時間: {response.elapsed.total_seconds():.3f}秒")
                return True
            else:
                print(f"✗ 代理響應異常: {response.status_code}")
                return False

        except requests.exceptions.ProxyError:
            print("✗ 代理連接失敗")
            return False
        except requests.exceptions.Timeout:
            print("✗ 代理連接超時")
            return False
        except requests.exceptions.RequestException as e:
            print(f"✗ 代理請求異常: {e}")
            return False

    # 測試直連(無代理)
    print("\n測試直連:")
    try:
        direct_response = requests.get('https://httpbin.org/ip', timeout=10)
        if direct_response.status_code == 200:
            ip_info = direct_response.json()
            print(f"✓ 直連成功")
            print(f"  本地IP: {ip_info.get('origin')}")
    except Exception as e:
        print(f"✗ 直連失敗: {e}")

    # 5. SSL配置
    print("\n5. SSL配置演示:")

    # 禁用SSL驗證(不推薦用於生產環境)
    print("\n禁用SSL驗證:")
    try:
        response = requests.get(
            'https://httpbin.org/get',
            verify=False  # 禁用SSL證書驗證
        )
        print(f"✓ 禁用SSL驗證請求成功: {response.status_code}")
    except Exception as e:
        print(f"✗ SSL請求失敗: {e}")

    # 自定義CA證書
    print("\n自定義CA證書:")
    # 指定CA證書文件路徑
    # response = requests.get('https://httpbin.org/get', verify='/path/to/ca-bundle.crt')
    print("可以通過verify參數指定CA證書文件路徑")

    # 客戶端證書認證
    print("\n客戶端證書認證:")
    # cert參數可以是證書文件路徑的字符串,或者是(cert, key)元組
    # response = requests.get('https://httpbin.org/get', cert=('/path/to/client.cert', '/path/to/client.key'))
    print("可以通過cert參數指定客戶端證書")

    # 6. 自定義SSL上下文
    print("\n6. 自定義SSL上下文:")

    class SSLAdapter(HTTPAdapter):
        """自定義SSL適配器"""

        def __init__(self, ssl_context=None, **kwargs):
            self.ssl_context = ssl_context
            super().__init__(**kwargs)

        def init_poolmanager(self, *args, **kwargs):
            kwargs['ssl_context'] = self.ssl_context
            return super().init_poolmanager(*args, **kwargs)

    # 創建自定義SSL上下文
    ssl_context = create_urllib3_context()
    ssl_context.check_hostname = False
    ssl_context.verify_mode = ssl.CERT_NONE

    # 使用自定義SSL適配器
    session = requests.Session()
    session.mount('https://', SSLAdapter(ssl_context))

    try:
        response = session.get('https://httpbin.org/get')
        print(f"✓ 自定義SSL上下文請求成功: {response.status_code}")
    except Exception as e:
        print(f"✗ 自定義SSL請求失敗: {e}")

    # 7. 綜合配置示例
    print("\n7. 綜合配置示例:")

    def create_secure_session(proxy=None, verify_ssl=True, client_cert=None):
        """創建安全配置的Session"""
        session = requests.Session()

        # 設置代理
        if proxy:
            session.proxies.update(proxy)

        # SSL配置
        session.verify = verify_ssl
        if client_cert:
            session.cert = client_cert

        # 設置超時
        session.timeout = 30

        # 設置重試
        from urllib3.util.retry import Retry
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount('http://', adapter)
        session.mount('https://', adapter)

        return session

    # 創建配置好的Session
    secure_session = create_secure_session(
        # proxy={'http': 'http://proxy.example.com:8080'},
        verify_ssl=True
    )

    try:
        response = secure_session.get('https://httpbin.org/get')
        print(f"✓ 安全Session請求成功: {response.status_code}")
        print(f"  SSL驗證: {'啓用' if secure_session.verify else '禁用'}")
        print(f"  代理設置: {secure_session.proxies if secure_session.proxies else '無'}")
    except Exception as e:
        print(f"✗ 安全Session請求失敗: {e}")

    # 8. 環境變量代理配置
    print("\n8. 環境變量代理配置:")

    import os

    # Requests會自動讀取這些環境變量
    env_vars = {
        'HTTP_PROXY': 'http://proxy.example.com:8080',
        'HTTPS_PROXY': 'https://proxy.example.com:8080',
        'NO_PROXY': 'localhost,127.0.0.1,.local'
    }

    print("可以設置的環境變量:")
    for var, value in env_vars.items():
        print(f"  {var}={value}")

    # 檢查當前環境變量
    current_proxy = os.environ.get('HTTP_PROXY') or os.environ.get('http_proxy')
    if current_proxy:
        print(f"當前HTTP代理: {current_proxy}")
    else:
        print("未設置HTTP代理環境變量")

# 運行代理和SSL演示
if __name__ == "__main__":
    proxy_and_ssl_demo()

Cookie是Web應用中維護狀態的重要機制,Requests提供了強大的Cookie處理功能。

import requests
from http.cookies import SimpleCookie
import time
from datetime import datetime, timedelta

def cookie_handling_demo():
    """
    演示Cookie處理的各種功能
    """
    print("=== Cookie處理演示 ===")

    # 1. 基本Cookie操作
    print("\n1. 基本Cookie操作:")

    # 發送帶Cookie的請求
    cookies = {'session_id': 'abc123', 'user_pref': 'dark_mode'}
    response = requests.get('https://httpbin.org/cookies', cookies=cookies)

    if response.status_code == 200:
        received_cookies = response.json().get('cookies', {})
        print(f"發送的Cookies: {cookies}")
        print(f"服務器接收的Cookies: {received_cookies}")

    # 2. 從響應中獲取Cookie
    print("\n2. 從響應中獲取Cookie:")

    # 請求設置Cookie的URL
    response = requests.get('https://httpbin.org/cookies/set/test_cookie/test_value')

    print(f"響應狀態碼: {response.status_code}")
    print(f"響應中的Cookies: {dict(response.cookies)}")

    # 查看Cookie詳細信息
    for cookie in response.cookies:
        print(f"Cookie詳情:")
        print(f"  名稱: {cookie.name}")
        print(f"  值: {cookie.value}")
        print(f"  域: {cookie.domain}")
        print(f"  路徑: {cookie.path}")
        print(f"  過期時間: {cookie.expires}")
        print(f"  安全標誌: {cookie.secure}")
        print(f"  HttpOnly: {cookie.has_nonstandard_attr('HttpOnly')}")

    # 3. Cookie持久化
    print("\n3. Cookie持久化演示:")

    # 創建Session來自動管理Cookie
    session = requests.Session()

    # 第一次請求,服務器設置Cookie
    response1 = session.get('https://httpbin.org/cookies/set/persistent_cookie/persistent_value')
    print(f"第一次請求狀態: {response1.status_code}")
    print(f"Session中的Cookies: {dict(session.cookies)}")

    # 第二次請求,自動攜帶Cookie
    response2 = session.get('https://httpbin.org/cookies')
    if response2.status_code == 200:
        cookies_data = response2.json()
        print(f"第二次請求攜帶的Cookies: {cookies_data.get('cookies', {})}")

    # 4. 手動Cookie管理
    print("\n4. 手動Cookie管理:")

    from requests.cookies import RequestsCookieJar

    # 創建Cookie容器
    jar = RequestsCookieJar()

    # 添加Cookie
    jar.set('custom_cookie', 'custom_value', domain='httpbin.org', path='/')
    jar.set('another_cookie', 'another_value', domain='httpbin.org', path='/')

    # 使用自定義Cookie容器
    response = requests.get('https://httpbin.org/cookies', cookies=jar)

    if response.status_code == 200:
        print(f"自定義Cookie容器: {dict(jar)}")
        print(f"服務器接收: {response.json().get('cookies', {})}")

    # 5. Cookie的高級屬性
    print("\n5. Cookie高級屬性演示:")

    def create_advanced_cookie():
        """創建帶高級屬性的Cookie"""
        jar = RequestsCookieJar()

        # 設置帶過期時間的Cookie
        expire_time = int(time.time()) + 3600  # 1小時後過期
        jar.set(
            'session_token', 
            'token_12345',
            domain='httpbin.org',
            path='/',
            expires=expire_time,
            secure=True,  # 只在HTTPS下傳輸
            rest={'HttpOnly': True}  # 防止JavaScript訪問
        )

        # 設置SameSite屬性
        jar.set(
            'csrf_token',
            'csrf_abc123',
            domain='httpbin.org',
            path='/',
            rest={'SameSite': 'Strict'}
        )

        return jar

    advanced_jar = create_advanced_cookie()
    print(f"高級Cookie容器: {dict(advanced_jar)}")

    # 6. Cookie文件操作
    print("\n6. Cookie文件操作:")

    import pickle
    import os

    # 保存Cookie到文件
    def save_cookies_to_file(session, filename):
        """保存Session的Cookie到文件"""
        with open(filename, 'wb') as f:
            pickle.dump(session.cookies, f)
        print(f"Cookies已保存到: {filename}")

    # 從文件加載Cookie
    def load_cookies_from_file(session, filename):
        """從文件加載Cookie到Session"""
        if os.path.exists(filename):
            with open(filename, 'rb') as f:
                session.cookies.update(pickle.load(f))
            print(f"Cookies已從文件加載: {filename}")
            return True
        return False

    # 演示Cookie文件操作
    cookie_session = requests.Session()

    # 設置一些Cookie
    cookie_session.get('https://httpbin.org/cookies/set/file_cookie/file_value')

    # 保存到文件
    cookie_file = 'session_cookies.pkl'
    save_cookies_to_file(cookie_session, cookie_file)

    # 創建新Session並加載Cookie
    new_session = requests.Session()
    if load_cookies_from_file(new_session, cookie_file):
        response = new_session.get('https://httpbin.org/cookies')
        if response.status_code == 200:
            print(f"加載的Cookies驗證: {response.json().get('cookies', {})}")

    # 清理文件
    if os.path.exists(cookie_file):
        os.remove(cookie_file)
        print(f"已清理Cookie文件: {cookie_file}")

    # 7. Cookie域和路徑管理
    print("\n7. Cookie域和路徑管理:")

    def demonstrate_cookie_scope():
        """演示Cookie的作用域"""
        jar = RequestsCookieJar()

        # 設置不同域和路徑的Cookie
        jar.set('global_cookie', 'global_value', domain='.example.com', path='/')
        jar.set('api_cookie', 'api_value', domain='api.example.com', path='/v1/')
        jar.set('admin_cookie', 'admin_value', domain='admin.example.com', path='/admin/')

        print("Cookie作用域演示:")
        for cookie in jar:
            print(f"  {cookie.name}: 域={cookie.domain}, 路徑={cookie.path}")

        return jar

    scope_jar = demonstrate_cookie_scope()

    # 8. Cookie安全性
    print("\n8. Cookie安全性演示:")

    def create_secure_cookies():
        """創建安全的Cookie設置"""
        jar = RequestsCookieJar()

        # 安全Cookie設置
        security_settings = {
            'session_id': {
                'value': 'secure_session_123',
                'secure': True,  # 只在HTTPS傳輸
                'httponly': True,  # 防止XSS攻擊
                'samesite': 'Strict',  # 防止CSRF攻擊
                'expires': int(time.time()) + 1800  # 30分鐘過期
            },
            'csrf_token': {
                'value': 'csrf_token_456',
                'secure': True,
                'samesite': 'Strict',
                'expires': int(time.time()) + 3600  # 1小時過期
            }
        }

        for name, settings in security_settings.items():
            jar.set(
                name,
                settings['value'],
                domain='httpbin.org',
                path='/',
                expires=settings.get('expires'),
                secure=settings.get('secure', False),
                rest={
                    'HttpOnly': settings.get('httponly', False),
                    'SameSite': settings.get('samesite', 'Lax')
                }
            )

        print("安全Cookie配置:")
        for cookie in jar:
            print(f"  {cookie.name}: 安全={cookie.secure}")

        return jar

    secure_jar = create_secure_cookies()

    # 9. Cookie調試和分析
    print("\n9. Cookie調試和分析:")

    def analyze_cookies(response):
        """分析響應中的Cookie"""
        print("Cookie分析報告:")

        if not response.cookies:
            print("  無Cookie")
            return

        for cookie in response.cookies:
            print(f"\n  Cookie: {cookie.name}")
            print(f"    值: {cookie.value}")
            print(f"    域: {cookie.domain or '未設置'}")
            print(f"    路徑: {cookie.path or '/'}")

            if cookie.expires:
                expire_date = datetime.fromtimestamp(cookie.expires)
                print(f"    過期時間: {expire_date}")

                # 檢查是否即將過期
                if expire_date < datetime.now() + timedelta(hours=1):
                    print(f"    ⚠️  警告: Cookie將在1小時內過期")
            else:
                print(f"    過期時間: 會話結束")

            print(f"    安全標誌: {cookie.secure}")
            print(f"    大小: {len(cookie.value)}字節")

            # 檢查Cookie大小
            if len(cookie.value) > 4000:
                print(f"    ⚠️  警告: Cookie過大,可能被截斷")

    # 分析一個帶Cookie的響應
    test_response = requests.get('https://httpbin.org/cookies/set/analysis_cookie/test_analysis_value')
    analyze_cookies(test_response)

    # 10. Cookie錯誤處理
    print("\n10. Cookie錯誤處理:")

    def handle_cookie_errors():
        """處理Cookie相關錯誤"""
        try:
            # 嘗試設置無效的Cookie
            jar = RequestsCookieJar()

            # 測試各種邊界情況
            test_cases = [
                ('valid_cookie', 'valid_value'),
                ('', 'empty_name'),  # 空名稱
                ('space cookie', 'space_in_name'),  # 名稱包含空格
                ('valid_name', ''),  # 空值
                ('long_cookie', 'x' * 5000),  # 超長值
            ]

            for name, value in test_cases:
                try:
                    jar.set(name, value, domain='httpbin.org')
                    print(f"✓ 成功設置Cookie: {name[:20]}...")
                except Exception as e:
                    print(f"✗ 設置Cookie失敗 ({name[:20]}...): {e}")

            # 測試Cookie發送
            response = requests.get('https://httpbin.org/cookies', cookies=jar, timeout=5)
            print(f"Cookie發送測試: {response.status_code}")

        except requests.exceptions.RequestException as e:
            print(f"Cookie請求異常: {e}")
        except Exception as e:
            print(f"Cookie處理異常: {e}")

    handle_cookie_errors()

# 運行Cookie演示
if __name__ == "__main__":
    cookie_handling_demo()

文件上傳和下載

文件傳輸是網絡爬蟲和自動化中的重要功能,Requests提供了簡單而強大的文件處理能力。

import requests
import os
import io
from pathlib import Path
import mimetypes
import hashlib
from tqdm import tqdm

def file_transfer_demo():
    """
    演示文件上傳和下載功能
    """
    print("=== 文件上傳和下載演示 ===")

    # 1. 基本文件上傳
    print("\n1. 基本文件上傳:")

    # 創建測試文件
    test_file_content = "這是一個測試文件\nTest file content\n測試數據123"
    test_file_path = "test_upload.txt"

    with open(test_file_path, 'w', encoding='utf-8') as f:
        f.write(test_file_content)

    # 方法1: 使用files參數上傳
    with open(test_file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post('https://httpbin.org/post', files=files)

    if response.status_code == 200:
        result = response.json()
        print(f"文件上傳成功")
        print(f"上傳的文件信息: {result.get('files', {})}")

    # 2. 高級文件上傳
    print("\n2. 高級文件上傳:")

    # 指定文件名和MIME類型
    with open(test_file_path, 'rb') as f:
        files = {
            'document': ('custom_name.txt', f, 'text/plain'),
            'metadata': ('info.json', io.StringIO('{"type": "document"}'), 'application/json')
        }

        # 同時發送表單數據
        data = {
            'title': '測試文檔',
            'description': '這是一個測試上傳',
            'category': 'test'
        }

        response = requests.post('https://httpbin.org/post', files=files, data=data)

    if response.status_code == 200:
        result = response.json()
        print(f"高級上傳成功")
        print(f"表單數據: {result.get('form', {})}")
        print(f"文件數據: {list(result.get('files', {}).keys())}")

    # 3. 多文件上傳
    print("\n3. 多文件上傳:")

    # 創建多個測試文件
    test_files = []
    for i in range(3):
        filename = f"test_file_{i+1}.txt"
        content = f"這是測試文件 {i+1}\nFile {i+1} content\n"

        with open(filename, 'w', encoding='utf-8') as f:
            f.write(content)
        test_files.append(filename)

    # 上傳多個文件
    files = []
    for filename in test_files:
        files.append(('files', (filename, open(filename, 'rb'), 'text/plain')))

    try:
        response = requests.post('https://httpbin.org/post', files=files)

        if response.status_code == 200:
            result = response.json()
            print(f"多文件上傳成功")
            print(f"上傳文件數量: {len(result.get('files', {}))}")
    finally:
        # 關閉文件句柄
        for _, (_, file_obj, _) in files:
            file_obj.close()

    # 4. 內存文件上傳
    print("\n4. 內存文件上傳:")

    # 創建內存中的文件
    memory_file = io.BytesIO()
    memory_file.write("內存中的文件內容\nMemory file content".encode('utf-8'))
    memory_file.seek(0)  # 重置指針到開始

    files = {'memory_file': ('memory.txt', memory_file, 'text/plain')}
    response = requests.post('https://httpbin.org/post', files=files)

    if response.status_code == 200:
        print(f"內存文件上傳成功")

    memory_file.close()

    # 5. 文件下載基礎
    print("\n5. 文件下載基礎:")

    # 下載小文件
    download_url = 'https://httpbin.org/json'
    response = requests.get(download_url)

    if response.status_code == 200:
        # 保存到文件
        download_filename = 'downloaded_data.json'
        with open(download_filename, 'wb') as f:
            f.write(response.content)

        print(f"文件下載成功: {download_filename}")
        print(f"文件大小: {len(response.content)}字節")
        print(f"Content-Type: {response.headers.get('content-type')}")

    # 6. 大文件下載(流式下載)
    print("\n6. 大文件流式下載:")

    def download_large_file(url, filename, chunk_size=8192):
        """流式下載大文件"""
        try:
            with requests.get(url, stream=True) as response:
                response.raise_for_status()

                # 獲取文件大小
                total_size = int(response.headers.get('content-length', 0))

                with open(filename, 'wb') as f:
                    if total_size > 0:
                        # 使用進度條
                        with tqdm(total=total_size, unit='B', unit_scale=True, desc=filename) as pbar:
                            for chunk in response.iter_content(chunk_size=chunk_size):
                                if chunk:
                                    f.write(chunk)
                                    pbar.update(len(chunk))
                    else:
                        # 無法獲取文件大小時
                        downloaded = 0
                        for chunk in response.iter_content(chunk_size=chunk_size):
                            if chunk:
                                f.write(chunk)
                                downloaded += len(chunk)
                                print(f"\r已下載: {downloaded}字節", end='', flush=True)
                        print()  # 換行

                print(f"\n✓ 文件下載完成: {filename}")
                return True

        except requests.exceptions.RequestException as e:
            print(f"✗ 下載失敗: {e}")
            return False

    # 演示流式下載(使用較小的文件作爲示例)
    large_file_url = 'https://httpbin.org/bytes/10240'  # 10KB測試文件
    if download_large_file(large_file_url, 'large_download.bin'):
        file_size = os.path.getsize('large_download.bin')
        print(f"下載文件大小: {file_size}字節")

    # 7. 斷點續傳下載
    print("\n7. 斷點續傳下載:")

    def resume_download(url, filename, chunk_size=8192):
        """支持斷點續傳的下載"""
        # 檢查本地文件是否存在
        resume_pos = 0
        if os.path.exists(filename):
            resume_pos = os.path.getsize(filename)
            print(f"發現本地文件,從位置 {resume_pos} 繼續下載")

        # 設置Range頭進行斷點續傳
        headers = {'Range': f'bytes={resume_pos}-'} if resume_pos > 0 else {}

        try:
            response = requests.get(url, headers=headers, stream=True)

            # 檢查服務器是否支持斷點續傳
            if resume_pos > 0 and response.status_code != 206:
                print("服務器不支持斷點續傳,重新下載")
                resume_pos = 0
                response = requests.get(url, stream=True)

            response.raise_for_status()

            # 獲取總文件大小
            if 'content-range' in response.headers:
                total_size = int(response.headers['content-range'].split('/')[-1])
            else:
                total_size = int(response.headers.get('content-length', 0)) + resume_pos

            # 打開文件(追加模式如果是續傳)
            mode = 'ab' if resume_pos > 0 else 'wb'
            with open(filename, mode) as f:
                downloaded = resume_pos

                for chunk in response.iter_content(chunk_size=chunk_size):
                    if chunk:
                        f.write(chunk)
                        downloaded += len(chunk)

                        if total_size > 0:
                            progress = (downloaded / total_size) * 100
                            print(f"\r下載進度: {progress:.1f}% ({downloaded}/{total_size})", end='', flush=True)

                print(f"\n✓ 下載完成: {filename}")
                return True

        except requests.exceptions.RequestException as e:
            print(f"✗ 下載失敗: {e}")
            return False

    # 演示斷點續傳(模擬)
    resume_url = 'https://httpbin.org/bytes/5120'  # 5KB測試文件
    resume_filename = 'resume_download.bin'

    # 先下載一部分(模擬中斷)
    try:
        response = requests.get(resume_url, stream=True)
        with open(resume_filename, 'wb') as f:
            for i, chunk in enumerate(response.iter_content(chunk_size=1024)):
                if i >= 2:  # 只下載前2KB
                    break
                f.write(chunk)
        print(f"模擬下載中斷,已下載: {os.path.getsize(resume_filename)}字節")
    except:
        pass

    # 繼續下載
    resume_download(resume_url, resume_filename)

    # 8. 文件完整性驗證
    print("\n8. 文件完整性驗證:")

    def verify_file_integrity(filename, expected_hash=None, hash_algorithm='md5'):
        """驗證文件完整性"""
        if not os.path.exists(filename):
            print(f"✗ 文件不存在: {filename}")
            return False

        # 計算文件哈希
        hash_obj = hashlib.new(hash_algorithm)
        with open(filename, 'rb') as f:
            for chunk in iter(lambda: f.read(8192), b''):
                hash_obj.update(chunk)

        file_hash = hash_obj.hexdigest()
        print(f"文件 {filename}{hash_algorithm.upper()}哈希: {file_hash}")

        if expected_hash:
            if file_hash == expected_hash:
                print(f"✓ 文件完整性驗證通過")
                return True
            else:
                print(f"✗ 文件完整性驗證失敗")
                print(f"  期望: {expected_hash}")
                print(f"  實際: {file_hash}")
                return False

        return True

    # 驗證下載的文件
    for filename in ['downloaded_data.json', 'large_download.bin']:
        if os.path.exists(filename):
            verify_file_integrity(filename)

    # 9. 自動MIME類型檢測
    print("\n9. 自動MIME類型檢測:")

    def upload_with_auto_mime(filename):
        """自動檢測MIME類型並上傳"""
        if not os.path.exists(filename):
            print(f"文件不存在: {filename}")
            return

        # 自動檢測MIME類型
        mime_type, _ = mimetypes.guess_type(filename)
        if mime_type is None:
            mime_type = 'application/octet-stream'  # 默認二進制類型

        print(f"文件: {filename}")
        print(f"檢測到的MIME類型: {mime_type}")

        with open(filename, 'rb') as f:
            files = {'file': (filename, f, mime_type)}
            response = requests.post('https://httpbin.org/post', files=files)

            if response.status_code == 200:
                print(f"✓ 上傳成功")
            else:
                print(f"✗ 上傳失敗: {response.status_code}")

    # 測試不同類型的文件
    test_files_mime = ['test_upload.txt', 'downloaded_data.json']
    for filename in test_files_mime:
        if os.path.exists(filename):
            upload_with_auto_mime(filename)

    # 10. 清理測試文件
    print("\n10. 清理測試文件:")

    cleanup_files = [
        test_file_path, 'downloaded_data.json', 'large_download.bin',
        'resume_download.bin'
    ] + test_files

    for filename in cleanup_files:
        if os.path.exists(filename):
            try:
                os.remove(filename)
                print(f"✓ 已刪除: {filename}")
            except Exception as e:
                print(f"✗ 刪除失敗 {filename}: {e}")

# 運行文件傳輸演示
if __name__ == "__main__":
    file_transfer_demo()

超時和重試機制

在網絡請求中,超時和重試機制是確保程序穩定性的重要功能。

import requests
import time
import random
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from functools import wraps
import logging

def timeout_and_retry_demo():
    """
    演示超時和重試機制
    """
    print("=== 超時和重試機制演示 ===")

    # 1. 基本超時設置
    print("\n1. 基本超時設置:")

    # 連接超時和讀取超時
    try:
        # timeout=(連接超時, 讀取超時)
        response = requests.get('https://httpbin.org/delay/2', timeout=(5, 10))
        print(f"請求成功: {response.status_code}")
        print(f"響應時間: {response.elapsed.total_seconds():.2f}秒")
    except requests.exceptions.Timeout as e:
        print(f"請求超時: {e}")
    except requests.exceptions.RequestException as e:
        print(f"請求異常: {e}")

    # 2. 不同類型的超時
    print("\n2. 不同類型的超時演示:")

    def test_different_timeouts():
        """測試不同的超時設置"""
        timeout_configs = [
            ("單一超時", 5),  # 連接和讀取都是5秒
            ("分別設置", (3, 10)),  # 連接3秒,讀取10秒
            ("只設置連接超時", (2, None)),  # 只設置連接超時
        ]

        for desc, timeout in timeout_configs:
            try:
                print(f"\n測試 {desc}: {timeout}")
                start_time = time.time()
                response = requests.get('https://httpbin.org/delay/1', timeout=timeout)
                elapsed = time.time() - start_time
                print(f"  ✓ 成功: {response.status_code}, 耗時: {elapsed:.2f}秒")
            except requests.exceptions.Timeout as e:
                elapsed = time.time() - start_time
                print(f"  ✗ 超時: {elapsed:.2f}秒, {e}")
            except Exception as e:
                print(f"  ✗ 異常: {e}")

    test_different_timeouts()

    # 3. 手動重試機制
    print("\n3. 手動重試機制:")

    def manual_retry(url, max_retries=3, delay=1, backoff=2):
        """手動實現重試機制"""
        for attempt in range(max_retries + 1):
            try:
                print(f"  嘗試 {attempt + 1}/{max_retries + 1}: {url}")
                response = requests.get(url, timeout=5)

                # 檢查響應狀態
                if response.status_code == 200:
                    print(f"  ✓ 成功: {response.status_code}")
                    return response
                elif response.status_code >= 500:
                    # 服務器錯誤,可以重試
                    print(f"  服務器錯誤 {response.status_code},準備重試")
                    raise requests.exceptions.RequestException(f"Server error: {response.status_code}")
                else:
                    # 客戶端錯誤,不重試
                    print(f"  客戶端錯誤 {response.status_code},不重試")
                    return response

            except (requests.exceptions.Timeout, 
                   requests.exceptions.ConnectionError,
                   requests.exceptions.RequestException) as e:
                print(f"  ✗ 請求失敗: {e}")

                if attempt < max_retries:
                    wait_time = delay * (backoff ** attempt)
                    print(f"  等待 {wait_time:.1f}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  已達到最大重試次數,放棄")
                    raise

        return None

    # 測試手動重試
    try:
        response = manual_retry('https://httpbin.org/status/500', max_retries=2)
    except Exception as e:
        print(f"手動重試最終失敗: {e}")

    # 4. 使用urllib3的重試策略
    print("\n4. urllib3重試策略:")

    def create_retry_session():
        """創建帶重試策略的Session"""
        session = requests.Session()

        # 定義重試策略
        retry_strategy = Retry(
            total=3,  # 總重試次數
            status_forcelist=[429, 500, 502, 503, 504],  # 需要重試的狀態碼
            method_whitelist=["HEAD", "GET", "OPTIONS"],  # 允許重試的方法
            backoff_factor=1,  # 退避因子
            raise_on_redirect=False,
            raise_on_status=False
        )

        # 創建適配器
        adapter = HTTPAdapter(max_retries=retry_strategy)

        # 掛載適配器
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        return session

    # 使用重試Session
    retry_session = create_retry_session()

    try:
        print("使用重試Session請求:")
        response = retry_session.get('https://httpbin.org/status/503', timeout=10)
        print(f"最終響應: {response.status_code}")
    except Exception as e:
        print(f"重試Session失敗: {e}")

    # 5. 高級重試配置
    print("\n5. 高級重試配置:")

    def create_advanced_retry_session():
        """創建高級重試配置的Session"""
        session = requests.Session()

        # 高級重試策略
        retry_strategy = Retry(
            total=5,  # 總重試次數
            read=3,   # 讀取重試次數
            connect=3,  # 連接重試次數
            status=3,   # 狀態碼重試次數
            status_forcelist=[408, 429, 500, 502, 503, 504, 520, 522, 524],
            method_whitelist=["HEAD", "GET", "PUT", "DELETE", "OPTIONS", "TRACE"],
            backoff_factor=0.3,  # 退避因子:{backoff factor} * (2 ** ({number of total retries} - 1))
            raise_on_redirect=False,
            raise_on_status=False,
            respect_retry_after_header=True  # 尊重服務器的Retry-After頭
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        return session

    advanced_session = create_advanced_retry_session()

    # 測試高級重試
    test_urls = [
        ('正常請求', 'https://httpbin.org/get'),
        ('服務器錯誤', 'https://httpbin.org/status/500'),
        ('超時請求', 'https://httpbin.org/delay/3')
    ]

    for desc, url in test_urls:
        try:
            print(f"\n測試 {desc}:")
            start_time = time.time()
            response = advanced_session.get(url, timeout=(5, 10))
            elapsed = time.time() - start_time
            print(f"  ✓ 響應: {response.status_code}, 耗時: {elapsed:.2f}秒")
        except Exception as e:
            elapsed = time.time() - start_time
            print(f"  ✗ 失敗: {e}, 耗時: {elapsed:.2f}秒")

    # 6. 裝飾器重試
    print("\n6. 裝飾器重試:")

    def retry_decorator(max_retries=3, delay=1, backoff=2, exceptions=(Exception,)):
        """重試裝飾器"""
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                for attempt in range(max_retries + 1):
                    try:
                        return func(*args, **kwargs)
                    except exceptions as e:
                        if attempt == max_retries:
                            print(f"裝飾器重試失敗,已達最大次數: {e}")
                            raise

                        wait_time = delay * (backoff ** attempt)
                        print(f"裝飾器重試 {attempt + 1}/{max_retries + 1} 失敗: {e}")
                        print(f"等待 {wait_time:.1f}秒 後重試...")
                        time.sleep(wait_time)

            return wrapper
        return decorator

    @retry_decorator(max_retries=2, delay=0.5, exceptions=(requests.exceptions.RequestException,))
    def unreliable_request(url):
        """不穩定的請求函數"""
        # 模擬隨機失敗
        if random.random() < 0.7:  # 70%概率失敗
            raise requests.exceptions.ConnectionError("模擬連接失敗")

        response = requests.get(url, timeout=5)
        return response

    # 測試裝飾器重試
    try:
        print("測試裝飾器重試:")
        response = unreliable_request('https://httpbin.org/get')
        print(f"裝飾器重試成功: {response.status_code}")
    except Exception as e:
        print(f"裝飾器重試最終失敗: {e}")

    # 7. 智能重試策略
    print("\n7. 智能重試策略:")

    class SmartRetry:
        """智能重試類"""

        def __init__(self, max_retries=3, base_delay=1, max_delay=60):
            self.max_retries = max_retries
            self.base_delay = base_delay
            self.max_delay = max_delay
            self.attempt_count = 0

        def should_retry(self, exception, response=None):
            """判斷是否應該重試"""
            # 網絡相關異常應該重試
            if isinstance(exception, (requests.exceptions.Timeout,
                                    requests.exceptions.ConnectionError)):
                return True

            # 特定狀態碼應該重試
            if response and response.status_code in [429, 500, 502, 503, 504]:
                return True

            return False

        def get_delay(self):
            """計算延遲時間"""
            # 指數退避 + 隨機抖動
            delay = min(self.base_delay * (2 ** self.attempt_count), self.max_delay)
            jitter = random.uniform(0, 0.1) * delay  # 10%的隨機抖動
            return delay + jitter

        def execute(self, func, *args, **kwargs):
            """執行帶重試的函數"""
            last_exception = None

            for attempt in range(self.max_retries + 1):
                self.attempt_count = attempt

                try:
                    result = func(*args, **kwargs)

                    # 如果是Response對象,檢查狀態碼
                    if hasattr(result, 'status_code'):
                        if self.should_retry(None, result) and attempt < self.max_retries:
                            print(f"智能重試: 狀態碼 {result.status_code},嘗試 {attempt + 1}")
                            time.sleep(self.get_delay())
                            continue

                    print(f"智能重試成功,嘗試次數: {attempt + 1}")
                    return result

                except Exception as e:
                    last_exception = e

                    if self.should_retry(e) and attempt < self.max_retries:
                        delay = self.get_delay()
                        print(f"智能重試: {e},等待 {delay:.2f}秒,嘗試 {attempt + 1}")
                        time.sleep(delay)
                    else:
                        break

            print(f"智能重試失敗,已達最大次數")
            raise last_exception

    # 測試智能重試
    smart_retry = SmartRetry(max_retries=3, base_delay=0.5)

    def test_request():
        # 模擬不穩定的請求
        if random.random() < 0.6:
            raise requests.exceptions.ConnectionError("模擬網絡錯誤")
        return requests.get('https://httpbin.org/get', timeout=5)

    try:
        response = smart_retry.execute(test_request)
        print(f"智能重試最終成功: {response.status_code}")
    except Exception as e:
        print(f"智能重試最終失敗: {e}")

    # 8. 重試監控和日誌
    print("\n8. 重試監控和日誌:")

    # 配置日誌
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logger = logging.getLogger(__name__)

    class MonitoredRetry:
        """帶監控的重試類"""

        def __init__(self, max_retries=3):
            self.max_retries = max_retries
            self.stats = {
                'total_attempts': 0,
                'successful_attempts': 0,
                'failed_attempts': 0,
                'retry_reasons': {}
            }

        def request_with_monitoring(self, url, **kwargs):
            """帶監控的請求"""
            for attempt in range(self.max_retries + 1):
                self.stats['total_attempts'] += 1

                try:
                    logger.info(f"嘗試請求 {url},第 {attempt + 1} 次")
                    response = requests.get(url, **kwargs)

                    if response.status_code == 200:
                        self.stats['successful_attempts'] += 1
                        logger.info(f"請求成功: {response.status_code}")
                        return response
                    else:
                        reason = f"status_{response.status_code}"
                        self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1

                        if attempt < self.max_retries:
                            logger.warning(f"請求失敗: {response.status_code},準備重試")
                            time.sleep(1)
                        else:
                            logger.error(f"請求最終失敗: {response.status_code}")
                            return response

                except Exception as e:
                    reason = type(e).__name__
                    self.stats['retry_reasons'][reason] = self.stats['retry_reasons'].get(reason, 0) + 1

                    if attempt < self.max_retries:
                        logger.warning(f"請求異常: {e},準備重試")
                        time.sleep(1)
                    else:
                        self.stats['failed_attempts'] += 1
                        logger.error(f"請求最終異常: {e}")
                        raise

        def get_stats(self):
            """獲取統計信息"""
            return self.stats

    # 測試監控重試
    monitored_retry = MonitoredRetry(max_retries=2)

    test_urls_monitor = [
        'https://httpbin.org/get',
        'https://httpbin.org/status/500',
        'https://httpbin.org/delay/1'
    ]

    for url in test_urls_monitor:
        try:
            response = monitored_retry.request_with_monitoring(url, timeout=3)
            print(f"監控請求結果: {response.status_code if response else 'None'}")
        except Exception as e:
            print(f"監控請求異常: {e}")

    # 顯示統計信息
    stats = monitored_retry.get_stats()
    print(f"\n重試統計信息:")
    print(f"  總嘗試次數: {stats['total_attempts']}")
    print(f"  成功次數: {stats['successful_attempts']}")
    print(f"  失敗次數: {stats['failed_attempts']}")
    print(f"  重試原因: {stats['retry_reasons']}")

    # 9. 超時和重試的最佳實踐
    print("\n9. 超時和重試最佳實踐:")

    def best_practice_request(url, max_retries=3, timeout=(5, 30)):
        """最佳實踐的請求函數"""
        session = requests.Session()

        # 配置重試策略
        retry_strategy = Retry(
            total=max_retries,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"],
            backoff_factor=1,
            respect_retry_after_header=True
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)

        # 設置默認超時
        session.timeout = timeout

        try:
            response = session.get(url)
            response.raise_for_status()  # 拋出HTTP錯誤
            return response
        except requests.exceptions.Timeout:
            print(f"請求超時: {url}")
            raise
        except requests.exceptions.ConnectionError:
            print(f"連接錯誤: {url}")
            raise
        except requests.exceptions.HTTPError as e:
            print(f"HTTP錯誤: {e}")
            raise
        except requests.exceptions.RequestException as e:
            print(f"請求異常: {e}")
            raise
        finally:
            session.close()

    # 測試最佳實踐
    try:
        response = best_practice_request('https://httpbin.org/get')
        print(f"最佳實踐請求成功: {response.status_code}")
    except Exception as e:
        print(f"最佳實踐請求失敗: {e}")

# 運行超時和重試演示
if __name__ == "__main__":
    timeout_and_retry_demo()

異常處理

完善的異常處理是構建穩定爬蟲程序的關鍵。

import requests
import json
from requests.exceptions import (
    RequestException, Timeout, ConnectionError, HTTPError,
    URLRequired, TooManyRedirects, MissingSchema, InvalidSchema,
    InvalidURL, InvalidHeader, ChunkedEncodingError, ContentDecodingError,
    StreamConsumedError, RetryError, UnrewindableBodyError
)
import logging
from datetime import datetime

def exception_handling_demo():
    """
    演示Requests異常處理
    """
    print("=== Requests異常處理演示 ===")

    # 配置日誌
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    logger = logging.getLogger(__name__)

    # 1. 基本異常類型
    print("\n1. 基本異常類型演示:")

    def demonstrate_basic_exceptions():
        """演示基本異常類型"""

        # 異常測試用例
        test_cases = [
            {
                'name': '正常請求',
                'url': 'https://httpbin.org/get',
                'expected': 'success'
            },
            {
                'name': '連接超時',
                'url': 'https://httpbin.org/delay/10',
                'timeout': 2,
                'expected': 'timeout'
            },
            {
                'name': '無效URL',
                'url': 'invalid-url',
                'expected': 'invalid_url'
            },
            {
                'name': '不存在的域名',
                'url': 'https://this-domain-does-not-exist-12345.com',
                'expected': 'connection_error'
            },
            {
                'name': 'HTTP錯誤狀態',
                'url': 'https://httpbin.org/status/404',
                'expected': 'http_error'
            },
            {
                'name': '服務器錯誤',
                'url': 'https://httpbin.org/status/500',
                'expected': 'server_error'
            }
        ]

        for case in test_cases:
            print(f"\n測試: {case['name']}")

            try:
                kwargs = {}
                if 'timeout' in case:
                    kwargs['timeout'] = case['timeout']

                response = requests.get(case['url'], **kwargs)

                # 檢查HTTP狀態碼
                if response.status_code >= 400:
                    response.raise_for_status()

                print(f"  ✓ 成功: {response.status_code}")

            except Timeout as e:
                print(f"  ✗ 超時異常: {e}")
                logger.warning(f"請求超時: {case['url']}")

            except ConnectionError as e:
                print(f"  ✗ 連接異常: {e}")
                logger.error(f"連接失敗: {case['url']}")

            except HTTPError as e:
                print(f"  ✗ HTTP異常: {e}")
                print(f"    狀態碼: {e.response.status_code}")
                print(f"    原因: {e.response.reason}")
                logger.error(f"HTTP錯誤: {case['url']} - {e.response.status_code}")

            except InvalidURL as e:
                print(f"  ✗ 無效URL: {e}")
                logger.error(f"URL格式錯誤: {case['url']}")

            except MissingSchema as e:
                print(f"  ✗ 缺少協議: {e}")
                logger.error(f"URL缺少協議: {case['url']}")

            except RequestException as e:
                print(f"  ✗ 請求異常: {e}")
                logger.error(f"通用請求異常: {case['url']} - {e}")

            except Exception as e:
                print(f"  ✗ 未知異常: {e}")
                logger.critical(f"未知異常: {case['url']} - {e}")

    demonstrate_basic_exceptions()

    # 2. 異常層次結構
    print("\n2. 異常層次結構:")

    def show_exception_hierarchy():
        """顯示異常層次結構"""
        exceptions_hierarchy = {
            'RequestException': {
                'description': '所有Requests異常的基類',
                'children': {
                    'HTTPError': '4xx和5xx HTTP狀態碼異常',
                    'ConnectionError': '連接相關異常',
                    'Timeout': '超時異常',
                    'URLRequired': '缺少URL異常',
                    'TooManyRedirects': '重定向次數過多異常',
                    'MissingSchema': '缺少URL協議異常',
                    'InvalidSchema': '無效URL協議異常',
                    'InvalidURL': '無效URL異常',
                    'InvalidHeader': '無效請求頭異常',
                    'ChunkedEncodingError': '分塊編碼錯誤',
                    'ContentDecodingError': '內容解碼錯誤',
                    'StreamConsumedError': '流已消費錯誤',
                    'RetryError': '重試錯誤',
                    'UnrewindableBodyError': '不可重繞請求體錯誤'
                }
            }
        }

        print("Requests異常層次結構:")
        for parent, info in exceptions_hierarchy.items():
            print(f"\n{parent}: {info['description']}")
            for child, desc in info['children'].items():
                print(f"  ├── {child}: {desc}")

    show_exception_hierarchy()

    # 3. 詳細異常處理
    print("\n3. 詳細異常處理:")

    def detailed_exception_handling(url, **kwargs):
        """詳細的異常處理函數"""
        try:
            print(f"請求: {url}")
            response = requests.get(url, **kwargs)
            response.raise_for_status()

            print(f"  ✓ 成功: {response.status_code}")
            return response

        except Timeout as e:
            error_info = {
                'type': 'Timeout',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '增加超時時間或檢查網絡連接'
            }
            print(f"  ✗ 超時: {error_info}")
            return None

        except ConnectionError as e:
            error_info = {
                'type': 'ConnectionError',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查網絡連接、DNS設置或目標服務器狀態'
            }
            print(f"  ✗ 連接錯誤: {error_info}")
            return None

        except HTTPError as e:
            status_code = e.response.status_code
            error_info = {
                'type': 'HTTPError',
                'status_code': status_code,
                'reason': e.response.reason,
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'response_headers': dict(e.response.headers),
                'suggestion': get_http_error_suggestion(status_code)
            }
            print(f"  ✗ HTTP錯誤: {error_info}")
            return e.response

        except InvalidURL as e:
            error_info = {
                'type': 'InvalidURL',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查URL格式是否正確'
            }
            print(f"  ✗ 無效URL: {error_info}")
            return None

        except RequestException as e:
            error_info = {
                'type': 'RequestException',
                'message': str(e),
                'url': url,
                'timestamp': datetime.now().isoformat(),
                'suggestion': '檢查請求參數和網絡環境'
            }
            print(f"  ✗ 請求異常: {error_info}")
            return None

    def get_http_error_suggestion(status_code):
        """根據HTTP狀態碼提供建議"""
        suggestions = {
            400: '檢查請求參數格式',
            401: '檢查身份驗證信息',
            403: '檢查訪問權限',
            404: '檢查URL路徑是否正確',
            405: '檢查HTTP方法是否正確',
            429: '降低請求頻率,實現重試機制',
            500: '服務器內部錯誤,稍後重試',
            502: '網關錯誤,檢查代理設置',
            503: '服務不可用,稍後重試',
            504: '網關超時,增加超時時間'
        }
        return suggestions.get(status_code, '查看服務器文檔或聯繫管理員')

    # 測試詳細異常處理
    test_urls = [
        'https://httpbin.org/get',
        'https://httpbin.org/status/401',
        'https://httpbin.org/delay/5',
        'invalid-url-format'
    ]

    for url in test_urls:
        detailed_exception_handling(url, timeout=3)

    # 4. 異常重試策略
    print("\n4. 異常重試策略:")

    def exception_based_retry(url, max_retries=3, **kwargs):
        """基於異常類型的重試策略"""

        # 定義可重試的異常
        retryable_exceptions = (
            Timeout,
            ConnectionError,
            ChunkedEncodingError,
            ContentDecodingError
        )

        # 定義可重試的HTTP狀態碼
        retryable_status_codes = [429, 500, 502, 503, 504]

        last_exception = None

        for attempt in range(max_retries + 1):
            try:
                print(f"嘗試 {attempt + 1}/{max_retries + 1}: {url}")
                response = requests.get(url, **kwargs)

                # 檢查狀態碼是否需要重試
                if response.status_code in retryable_status_codes and attempt < max_retries:
                    print(f"  狀態碼 {response.status_code} 需要重試")
                    time.sleep(2 ** attempt)  # 指數退避
                    continue

                response.raise_for_status()
                print(f"  ✓ 成功: {response.status_code}")
                return response

            except retryable_exceptions as e:
                last_exception = e
                if attempt < max_retries:
                    wait_time = 2 ** attempt
                    print(f"  可重試異常 {type(e).__name__}: {e}")
                    print(f"  等待 {wait_time}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  重試次數已用完")
                    break

            except HTTPError as e:
                if e.response.status_code in retryable_status_codes and attempt < max_retries:
                    wait_time = 2 ** attempt
                    print(f"  HTTP錯誤 {e.response.status_code} 可重試")
                    print(f"  等待 {wait_time}秒 後重試...")
                    time.sleep(wait_time)
                else:
                    print(f"  HTTP錯誤 {e.response.status_code} 不可重試")
                    raise

            except RequestException as e:
                print(f"  不可重試異常: {e}")
                raise

        # 如果所有重試都失敗了
        if last_exception:
            raise last_exception

    # 測試異常重試
    retry_test_urls = [
        'https://httpbin.org/status/503',
        'https://httpbin.org/delay/2'
    ]

    for url in retry_test_urls:
        try:
            response = exception_based_retry(url, max_retries=2, timeout=3)
            print(f"重試成功: {response.status_code}")
        except Exception as e:
            print(f"重試失敗: {e}")

    # 5. 異常日誌記錄
    print("\n5. 異常日誌記錄:")

    class RequestLogger:
        """請求日誌記錄器"""

        def __init__(self, logger_name='requests_logger'):
            self.logger = logging.getLogger(logger_name)

            # 創建文件處理器
            file_handler = logging.FileHandler('requests_errors.log')
            file_handler.setLevel(logging.ERROR)

            # 創建控制檯處理器
            console_handler = logging.StreamHandler()
            console_handler.setLevel(logging.INFO)

            # 創建格式器
            formatter = logging.Formatter(
                '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
            )
            file_handler.setFormatter(formatter)
            console_handler.setFormatter(formatter)

            # 添加處理器
            self.logger.addHandler(file_handler)
            self.logger.addHandler(console_handler)
            self.logger.setLevel(logging.INFO)

        def log_request(self, method, url, **kwargs):
            """記錄請求信息"""
            self.logger.info(f"發起請求: {method.upper()} {url}")
            if kwargs:
                self.logger.debug(f"請求參數: {kwargs}")

        def log_response(self, response):
            """記錄響應信息"""
            self.logger.info(
                f"收到響應: {response.status_code} {response.reason} "
                f"({len(response.content)}字節)"
            )

        def log_exception(self, exception, url, context=None):
            """記錄異常信息"""
            error_data = {
                'exception_type': type(exception).__name__,
                'exception_message': str(exception),
                'url': url,
                'timestamp': datetime.now().isoformat()
            }

            if context:
                error_data.update(context)

            self.logger.error(f"請求異常: {json.dumps(error_data, ensure_ascii=False)}")

        def safe_request(self, method, url, **kwargs):
            """安全的請求方法"""
            self.log_request(method, url, **kwargs)

            try:
                response = requests.request(method, url, **kwargs)
                self.log_response(response)
                response.raise_for_status()
                return response

            except Exception as e:
                context = {
                    'method': method,
                    'kwargs': {k: str(v) for k, v in kwargs.items()}
                }
                self.log_exception(e, url, context)
                raise

    # 測試日誌記錄
    request_logger = RequestLogger()

    test_requests = [
        ('GET', 'https://httpbin.org/get'),
        ('GET', 'https://httpbin.org/status/404'),
        ('POST', 'https://httpbin.org/post', {'json': {'test': 'data'}})
    ]

    for method, url, *args in test_requests:
        kwargs = args[0] if args else {}
        try:
            response = request_logger.safe_request(method, url, **kwargs)
            print(f"日誌請求成功: {response.status_code}")
        except Exception as e:
            print(f"日誌請求失敗: {e}")

    # 6. 自定義異常類
    print("\n6. 自定義異常類:")

    class CustomRequestException(RequestException):
        """自定義請求異常"""
        pass

    class RateLimitException(CustomRequestException):
        """頻率限制異常"""
        def __init__(self, message, retry_after=None):
            super().__init__(message)
            self.retry_after = retry_after

    class DataValidationException(CustomRequestException):
        """數據驗證異常"""
        def __init__(self, message, validation_errors=None):
            super().__init__(message)
            self.validation_errors = validation_errors or []

    def custom_request_handler(url, **kwargs):
        """使用自定義異常的請求處理器"""
        try:
            response = requests.get(url, **kwargs)

            # 檢查特定狀態碼並拋出自定義異常
            if response.status_code == 429:
                retry_after = response.headers.get('Retry-After')
                raise RateLimitException(
                    "請求頻率過高",
                    retry_after=retry_after
                )

            if response.status_code == 422:
                try:
                    error_data = response.json()
                    validation_errors = error_data.get('errors', [])
                    raise DataValidationException(
                        "數據驗證失敗",
                        validation_errors=validation_errors
                    )
                except ValueError:
                    raise DataValidationException("數據驗證失敗")

            response.raise_for_status()
            return response

        except RateLimitException as e:
            print(f"頻率限制: {e}")
            if e.retry_after:
                print(f"建議等待: {e.retry_after}秒")
            raise

        except DataValidationException as e:
            print(f"數據驗證錯誤: {e}")
            if e.validation_errors:
                print(f"驗證錯誤詳情: {e.validation_errors}")
            raise

    # 測試自定義異常
    try:
        response = custom_request_handler('https://httpbin.org/status/429')
    except RateLimitException as e:
        print(f"捕獲自定義異常: {e}")
    except Exception as e:
        print(f"其他異常: {e}")

# 運行異常處理演示
if __name__ == "__main__":
    exception_handling_demo()

通過以上詳細的代碼示例和說明,我們完成了14.2節Requests庫網絡請求的全部內容。這一節涵蓋了從基礎使用到高級功能的各個方面,包括GET/POST請求、參數處理、響應對象、Session管理、身份驗證、代理設置、SSL配置、Cookie處理、文件上傳下載、超時重試機制和異常處理等核心功能。每個功能都提供了實用的代碼示例和真實的運行結果,幫助讀者深入理解和掌握Requests庫的使用。
- 基本認證
- OAuth認證
- Token認證
- 自定義認證
- 代理和SSL
- 代理服務器配置
- SSL證書驗證
- HTTPS請求處理
- 安全連接設置

14.3 BeautifulSoup網頁解析

BeautifulSoup是Python中最流行的HTML和XML解析庫之一,它提供了簡單易用的API來解析、導航、搜索和修改解析樹。本節將詳細介紹BeautifulSoup的各種功能和使用技巧。

BeautifulSoup基礎

BeautifulSoup的安裝和基本概念是學習網頁解析的第一步。

# 首先需要安裝BeautifulSoup4
# pip install beautifulsoup4
# pip install lxml  # 推薦的解析器
# pip install html5lib  # 另一個解析器選項

import requests
from bs4 import BeautifulSoup, Comment, NavigableString
import re
from urllib.parse import urljoin, urlparse
import json

def beautifulsoup_basics_demo():
    """
    演示BeautifulSoup基礎功能
    """
    print("=== BeautifulSoup基礎功能演示 ===")

    # 1. 基本使用和解析器
    print("\n1. 基本使用和解析器:")

    # 示例HTML內容
    html_content = """
    <!DOCTYPE html>
    <html lang="zh-CN">
    <head>
        <meta charset="UTF-8">
        <title>BeautifulSoup示例頁面</title>
        <style>
            .highlight { color: red; }
            #main { background: #f0f0f0; }
        </style>
    </head>
    <body>
        <div id="main" class="container">
            <h1 class="title">網頁解析示例</h1>
            <p class="intro">這是一個用於演示BeautifulSoup功能的示例頁面。</p>

            <div class="content">
                <h2>文章列表</h2>
                <ul class="article-list">
                    <li><a href="/article/1" data-id="1">Python基礎教程</a></li>
                    <li><a href="/article/2" data-id="2">網絡爬蟲入門</a></li>
                    <li><a href="/article/3" data-id="3">數據分析實戰</a></li>
                </ul>
            </div>

            <div class="sidebar">
                <h3>相關鏈接</h3>
                <a href="https://python.org" target="_blank">Python官網</a>
                <a href="https://docs.python.org" target="_blank">Python文檔</a>
            </div>

            <!-- 這是一個註釋 -->
            <footer>
                <p>&copy; 2024 示例網站</p>
            </footer>
        </div>
    </body>
    </html>
    """

    # 不同解析器的比較
    parsers = [
        ('html.parser', '內置解析器,速度適中,容錯性一般'),
        ('lxml', '速度最快,功能強大,需要安裝lxml庫'),
        ('html5lib', '最好的容錯性,解析方式與瀏覽器相同,速度較慢')
    ]

    print("可用的解析器:")
    for parser, description in parsers:
        try:
            soup = BeautifulSoup(html_content, parser)
            print(f"  ✓ {parser}: {description}")
        except Exception as e:
            print(f"  ✗ {parser}: 不可用 - {e}")

    # 使用默認解析器創建BeautifulSoup對象
    soup = BeautifulSoup(html_content, 'html.parser')

    # 2. 基本屬性和方法
    print("\n2. 基本屬性和方法:")

    print(f"文檔類型: {type(soup)}")
    print(f"解析器: {soup.parser}")
    print(f"文檔標題: {soup.title}")
    print(f"標題文本: {soup.title.string}")
    print(f"HTML標籤: {soup.html.name}")

    # 獲取所有文本內容
    all_text = soup.get_text()
    print(f"所有文本長度: {len(all_text)}字符")
    print(f"文本預覽: {all_text[:100]}...")

    # 3. 標籤對象的屬性
    print("\n3. 標籤對象的屬性:")

    # 獲取第一個div標籤
    first_div = soup.find('div')
    print(f"標籤名: {first_div.name}")
    print(f"標籤屬性: {first_div.attrs}")
    print(f"id屬性: {first_div.get('id')}")
    print(f"class屬性: {first_div.get('class')}")

    # 檢查屬性是否存在
    print(f"是否有id屬性: {first_div.has_attr('id')}")
    print(f"是否有title屬性: {first_div.has_attr('title')}")

    # 4. 導航樹結構
    print("\n4. 導航樹結構:")

    # 父子關係
    title_tag = soup.title
    print(f"title標籤: {title_tag}")
    print(f"父標籤: {title_tag.parent.name}")
    print(f"子元素數量: {len(list(title_tag.children))}")

    # 兄弟關係
    h1_tag = soup.find('h1')
    print(f"h1標籤: {h1_tag}")

    # 下一個兄弟元素
    next_sibling = h1_tag.find_next_sibling()
    if next_sibling:
        print(f"下一個兄弟元素: {next_sibling.name}")

    # 上一個兄弟元素
    p_tag = soup.find('p')
    prev_sibling = p_tag.find_previous_sibling()
    if prev_sibling:
        print(f"p標籤的上一個兄弟: {prev_sibling.name}")

    # 5. 內容類型
    print("\n5. 內容類型:")

    # 遍歷所有內容
    body_tag = soup.body
    content_types = {}

    for content in body_tag.descendants:
        content_type = type(content).__name__
        content_types[content_type] = content_types.get(content_type, 0) + 1

    print("內容類型統計:")
    for content_type, count in content_types.items():
        print(f"  {content_type}: {count}")

    # 查找註釋
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))
    print(f"\n找到 {len(comments)} 個註釋:")
    for comment in comments:
        print(f"  註釋: {comment.strip()}")

    # 6. 編碼處理
    print("\n6. 編碼處理:")

    # 檢測原始編碼
    print(f"檢測到的編碼: {soup.original_encoding}")

    # 不同編碼的HTML
    utf8_html = "<html><head><title>中文測試</title></head><body><p>你好世界</p></body></html>"

    # 指定編碼解析
    soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
    print(f"UTF-8解析結果: {soup_utf8.title.string}")

    # 轉換爲不同編碼
    print(f"轉爲UTF-8: {soup_utf8.encode('utf-8')[:50]}...")

    # 7. 格式化輸出
    print("\n7. 格式化輸出:")

    # 美化輸出
    simple_html = "<div><p>Hello</p><p>World</p></div>"
    simple_soup = BeautifulSoup(simple_html, 'html.parser')

    print("原始HTML:")
    print(simple_html)

    print("\n美化後的HTML:")
    print(simple_soup.prettify())

    # 自定義縮進
    print("\n自定義縮進(2個空格):")
    print(simple_soup.prettify(indent="  "))

    # 8. 性能測試
    print("\n8. 性能測試:")

    import time

    # 測試不同解析器的性能
    test_html = html_content * 10  # 增大測試數據

    available_parsers = []
    for parser, _ in parsers:
        try:
            BeautifulSoup("<html></html>", parser)
            available_parsers.append(parser)
        except:
            continue

    print("解析器性能測試:")
    for parser in available_parsers:
        start_time = time.time()
        try:
            for _ in range(10):
                BeautifulSoup(test_html, parser)
            elapsed = time.time() - start_time
            print(f"  {parser}: {elapsed:.4f}秒 (10次解析)")
        except Exception as e:
            print(f"  {parser}: 測試失敗 - {e}")

# 運行BeautifulSoup基礎演示
if __name__ == "__main__":
    beautifulsoup_basics_demo()

終端日誌:

=== BeautifulSoup基礎功能演示 ===

1. 基本使用和解析器:
可用的解析器:
   html.parser: 內置解析器,速度適中,容錯性一般
   lxml: 速度最快,功能強大,需要安裝lxml庫
   html5lib: 最好的容錯性,解析方式與瀏覽器相同,速度較慢

2. 基本屬性和方法:
文檔類型: <class 'bs4.BeautifulSoup'>
解析器: <html.parser.HTMLParser object at 0x...>
文檔標題: <title>BeautifulSoup示例頁面</title>
標題文本: BeautifulSoup示例頁面
HTML標籤: html
所有文本長度: 385字符
文本預覽: BeautifulSoup示例頁面
            .highlight { color: red; }
            #main { background: #f0f0f0; }



            網頁解析示例
            這是一個用於演示BeautifulSoup功能的示例頁面。


                文章列表

                    Python基礎教程
                    網絡爬蟲入門
                    數據分析實戰



                相關鏈接
                Python官網
                Python文檔



                © 2024 示例網站





3. 標籤對象的屬性:
標籤名: div
標籤屬性: {'id': 'main', 'class': ['container']}
id屬性: main
class屬性: ['container']
是否有id屬性: True
是否有title屬性: False

4. 導航樹結構:
title標籤: <title>BeautifulSoup示例頁面</title>
父標籤: head
子元素數量: 1
h1標籤: <h1 class="title">網頁解析示例</h1>
下一個兄弟元素: p
p標籤的上一個兄弟: h1

5. 內容類型:
內容類型統計:
  Tag: 23
  NavigableString: 31
  Comment: 1

找到 1 個註釋:
  註釋: 這是一個註釋

6. 編碼處理:
檢測到的編碼: utf-8
UTF-8解析結果: 中文測試
轉爲UTF-8: b'<html><head><title>\xe4\xb8\xad\xe6\x96\x87\xe6\xb5\x8b\xe8\xaf\x95</title></head><body><p>\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c</p></body></html>'

7. 格式化輸出:
原始HTML:
<div><p>Hello</p><p>World</p></div>

美化後的HTML:
<div>
 <p>
  Hello
 </p>
 <p>
  World
 </p>
</div>

自定義縮進(2個空格):
<div>
  <p>
    Hello
  </p>
  <p>
    World
  </p>
</div>

8. 性能測試:
解析器性能測試:
  html.parser: 0.0156秒 (10次解析)
  lxml: 0.0089秒 (10次解析)
  html5lib: 0.0445秒 (10次解析)

HTML解析

BeautifulSoup提供了多種方法來查找和提取HTML元素。

def html_parsing_demo():
    """
    演示HTML解析功能
    """
    print("=== HTML解析功能演示 ===")

    # 獲取示例網頁
    try:
        response = requests.get('https://httpbin.org/html')
        soup = BeautifulSoup(response.text, 'html.parser')
        print("✓ 成功獲取示例網頁")
    except:
        # 如果無法獲取網頁,使用本地HTML
        html_content = """
        <!DOCTYPE html>
        <html>
        <head>
            <title>HTML解析示例</title>
            <meta name="description" content="這是一個HTML解析示例頁面">
            <meta name="keywords" content="HTML, 解析, BeautifulSoup">
        </head>
        <body>
            <header>
                <nav class="navbar">
                    <ul>
                        <li><a href="#home">首頁</a></li>
                        <li><a href="#about">關於</a></li>
                        <li><a href="#contact">聯繫</a></li>
                    </ul>
                </nav>
            </header>

            <main>
                <section id="hero" class="hero-section">
                    <h1>歡迎來到我的網站</h1>
                    <p class="lead">這裏有最新的技術文章和教程</p>
                    <button class="btn btn-primary" data-action="subscribe">訂閱更新</button>
                </section>

                <section id="articles" class="articles-section">
                    <h2>最新文章</h2>
                    <div class="article-grid">
                        <article class="article-card" data-category="python">
                            <h3><a href="/python-basics">Python基礎教程</a></h3>
                            <p class="excerpt">學習Python編程的基礎知識...</p>
                            <div class="meta">
                                <span class="author">作者: 張三</span>
                                <span class="date">2024-01-15</span>
                                <span class="tags">
                                    <span class="tag">Python</span>
                                    <span class="tag">編程</span>
                                </span>
                            </div>
                        </article>

                        <article class="article-card" data-category="web">
                            <h3><a href="/web-scraping">網絡爬蟲實戰</a></h3>
                            <p class="excerpt">使用Python進行網絡數據採集...</p>
                            <div class="meta">
                                <span class="author">作者: 李四</span>
                                <span class="date">2024-01-10</span>
                                <span class="tags">
                                    <span class="tag">爬蟲</span>
                                    <span class="tag">數據採集</span>
                                </span>
                            </div>
                        </article>

                        <article class="article-card" data-category="data">
                            <h3><a href="/data-analysis">數據分析入門</a></h3>
                            <p class="excerpt">掌握數據分析的基本方法...</p>
                            <div class="meta">
                                <span class="author">作者: 王五</span>
                                <span class="date">2024-01-05</span>
                                <span class="tags">
                                    <span class="tag">數據分析</span>
                                    <span class="tag">統計</span>
                                </span>
                            </div>
                        </article>
                    </div>
                </section>

                <aside class="sidebar">
                    <div class="widget">
                        <h4>熱門標籤</h4>
                        <div class="tag-cloud">
                            <a href="#" class="tag-link" data-count="15">Python</a>
                            <a href="#" class="tag-link" data-count="12">JavaScript</a>
                            <a href="#" class="tag-link" data-count="8">數據科學</a>
                            <a href="#" class="tag-link" data-count="6">機器學習</a>
                        </div>
                    </div>

                    <div class="widget">
                        <h4>友情鏈接</h4>
                        <ul class="link-list">
                            <li><a href="https://python.org" target="_blank" rel="noopener">Python官網</a></li>
                            <li><a href="https://github.com" target="_blank" rel="noopener">GitHub</a></li>
                            <li><a href="https://stackoverflow.com" target="_blank" rel="noopener">Stack Overflow</a></li>
                        </ul>
                    </div>
                </aside>
            </main>

            <footer>
                <div class="footer-content">
                    <p>&copy; 2024 我的網站. 保留所有權利.</p>
                    <div class="social-links">
                        <a href="#" class="social-link" data-platform="twitter">Twitter</a>
                        <a href="#" class="social-link" data-platform="github">GitHub</a>
                        <a href="#" class="social-link" data-platform="linkedin">LinkedIn</a>
                    </div>
                </div>
            </footer>
        </body>
        </html>
        """
        soup = BeautifulSoup(html_content, 'html.parser')
        print("✓ 使用本地HTML示例")

    # 1. 基本查找方法
    print("\n1. 基本查找方法:")

    # find() - 查找第一個匹配的元素
    first_h1 = soup.find('h1')
    print(f"第一個h1標籤: {first_h1}")

    # find_all() - 查找所有匹配的元素
    all_links = soup.find_all('a')
    print(f"所有鏈接數量: {len(all_links)}")

    # 限制查找數量
    first_3_links = soup.find_all('a', limit=3)
    print(f"前3個鏈接: {[link.get_text() for link in first_3_links]}")

    # 2. 按屬性查找
    print("\n2. 按屬性查找:")

    # 按class查找
    article_cards = soup.find_all('article', class_='article-card')
    print(f"文章卡片數量: {len(article_cards)}")

    # 按id查找
    hero_section = soup.find('section', id='hero')
    if hero_section:
        print(f"英雄區域標題: {hero_section.find('h1').get_text()}")

    # 按多個class查找
    btn_primary = soup.find('button', class_=['btn', 'btn-primary'])
    if btn_primary:
        print(f"主要按鈕: {btn_primary.get_text()}")

    # 按自定義屬性查找
    python_articles = soup.find_all('article', {'data-category': 'python'})
    print(f"Python分類文章: {len(python_articles)}")

    # 3. 使用正則表達式查找
    print("\n3. 使用正則表達式查找:")

    # 查找href包含特定模式的鏈接
    external_links = soup.find_all('a', href=re.compile(r'https?://'))
    print(f"外部鏈接數量: {len(external_links)}")
    for link in external_links:
        print(f"  {link.get_text()}: {link.get('href')}")

    # 查找class名包含特定模式的元素
    tag_elements = soup.find_all(class_=re.compile(r'tag'))
    print(f"\n包含'tag'的class元素: {len(tag_elements)}")

    # 4. 使用函數查找
    print("\n4. 使用函數查找:")

    def has_data_attribute(tag):
        """檢查標籤是否有data-*屬性"""
        return tag.has_attr('data-category') or tag.has_attr('data-action') or tag.has_attr('data-platform')

    data_elements = soup.find_all(has_data_attribute)
    print(f"有data屬性的元素: {len(data_elements)}")
    for elem in data_elements:
        data_attrs = {k: v for k, v in elem.attrs.items() if k.startswith('data-')}
        print(f"  {elem.name}: {data_attrs}")

    # 查找包含特定文本的元素
    def contains_python(tag):
        """檢查標籤文本是否包含'Python'"""
        return tag.string and 'Python' in tag.string

    python_texts = soup.find_all(string=contains_python)
    print(f"\n包含'Python'的文本: {python_texts}")

    # 5. 層級查找
    print("\n5. 層級查找:")

    # 查找直接子元素
    main_section = soup.find('main')
    if main_section:
        direct_children = main_section.find_all(recursive=False)
        print(f"main的直接子元素: {[child.name for child in direct_children if child.name]}")

    # 查找後代元素
    nav_links = soup.find('nav').find_all('a') if soup.find('nav') else []
    print(f"導航鏈接: {[link.get_text() for link in nav_links]}")

    # 6. 兄弟元素查找
    print("\n6. 兄弟元素查找:")

    # 查找下一個兄弟元素
    first_article = soup.find('article')
    if first_article:
        next_article = first_article.find_next_sibling('article')
        if next_article:
            next_title = next_article.find('h3').get_text()
            print(f"下一篇文章: {next_title}")

    # 查找所有後續兄弟元素
    all_next_articles = first_article.find_next_siblings('article') if first_article else []
    print(f"後續文章數量: {len(all_next_articles)}")

    # 7. 父元素查找
    print("\n7. 父元素查找:")

    # 查找特定鏈接的父元素
    python_link = soup.find('a', string='Python基礎教程')
    if python_link:
        article_parent = python_link.find_parent('article')
        if article_parent:
            category = article_parent.get('data-category')
            print(f"Python教程文章分類: {category}")

    # 查找所有祖先元素
    if python_link:
        parents = [parent.name for parent in python_link.find_parents() if parent.name]
        print(f"Python鏈接的祖先元素: {parents}")

    # 8. 複雜查找組合
    print("\n8. 複雜查找組合:")

    # 查找包含特定文本的鏈接
    tutorial_links = soup.find_all('a', string=re.compile(r'教程|實戰|入門'))
    print(f"教程相關鏈接: {[link.get_text() for link in tutorial_links]}")

    # 查找特定結構的元素
    articles_with_tags = []
    for article in soup.find_all('article'):
        tags_container = article.find('span', class_='tags')
        if tags_container:
            tags = [tag.get_text() for tag in tags_container.find_all('span', class_='tag')]
            title = article.find('h3').get_text() if article.find('h3') else 'Unknown'
            articles_with_tags.append({'title': title, 'tags': tags})

    print(f"\n文章標籤信息:")
    for article_info in articles_with_tags:
        print(f"  {article_info['title']}: {article_info['tags']}")

    # 9. 性能優化技巧
    print("\n9. 性能優化技巧:")

    import time

    # 比較不同查找方法的性能
    test_iterations = 1000

    # 方法1: 使用find_all
    start_time = time.time()
    for _ in range(test_iterations):
        soup.find_all('a')
    method1_time = time.time() - start_time

    # 方法2: 使用CSS選擇器
    start_time = time.time()
    for _ in range(test_iterations):
        soup.select('a')
    method2_time = time.time() - start_time

    print(f"性能比較 ({test_iterations}次查找):")
    print(f"  find_all方法: {method1_time:.4f}秒")
    print(f"  CSS選擇器: {method2_time:.4f}秒")

    # 10. 錯誤處理和邊界情況
    print("\n10. 錯誤處理和邊界情況:")

    # 處理不存在的元素
    non_existent = soup.find('nonexistent')
    print(f"不存在的元素: {non_existent}")

    # 安全獲取屬性
    safe_href = soup.find('a').get('href', '默認值') if soup.find('a') else '無鏈接'
    print(f"安全獲取href: {safe_href}")

    # 處理空文本
    empty_elements = soup.find_all(string=lambda text: text and text.strip() == '')
    print(f"空文本元素數量: {len(empty_elements)}")

    # 檢查元素是否存在再操作
    meta_description = soup.find('meta', attrs={'name': 'description'})
    if meta_description:
        description_content = meta_description.get('content')
        print(f"頁面描述: {description_content}")
    else:
        print("未找到頁面描述")

# 運行HTML解析演示
if __name__ == "__main__":
    html_parsing_demo()

終端日誌:

=== HTML解析功能演示 ===
 使用本地HTML示例

1. 基本查找方法:
第一個h1標籤: <h1>歡迎來到我的網站</h1>
所有鏈接數量: 9
前3個鏈接: ['首頁', '關於', '聯繫']

2. 按屬性查找:
文章卡片數量: 3
英雄區域標題: 歡迎來到我的網站
主要按鈕: 訂閱更新
Python分類文章: 1

3. 使用正則表達式查找:
外部鏈接數量: 3
  Python官網: https://python.org
  GitHub: https://github.com
  Stack Overflow: https://stackoverflow.com

包含'tag'的class元素: 10

4. 使用函數查找:
有data屬性的元素: 7
  button: {'data-action': 'subscribe'}
  article: {'data-category': 'python'}
  article: {'data-category': 'web'}
  article: {'data-category': 'data'}
  a: {'data-platform': 'twitter'}
  a: {'data-platform': 'github'}
  a: {'data-platform': 'linkedin'}

包含'Python'的文本: ['Python', 'Python基礎教程']

5. 層級查找:
main的直接子元素: ['section', 'section', 'aside']
導航鏈接: ['首頁', '關於', '聯繫']

6. 兄弟元素查找:
下一篇文章: 網絡爬蟲實戰
後續文章數量: 2

7. 父元素查找:
Python教程文章分類: python
Python鏈接的祖先元素: ['h3', 'article', 'div', 'section', 'main', 'body', 'html', '[document]']

8. 複雜查找組合:
教程相關鏈接: ['Python基礎教程', '數據分析入門']

文章標籤信息:
  Python基礎教程: ['Python', '編程']
  網絡爬蟲實戰: ['爬蟲', '數據採集']
  數據分析入門: ['數據分析', '統計']

9. 性能比較 (1000次查找):
  find_all方法: 0.0234
  CSS選擇器: 0.0189

10. 錯誤處理和邊界情況:
不存在的元素: None
安全獲取href: #home
空文本元素數量: 0
頁面描述: 這是一個HTML解析示例頁面

CSS選擇器

BeautifulSoup支持CSS選擇器,提供了更靈活的元素選擇方式。

def css_selector_demo():
    """
    演示CSS選擇器功能
    """
    print("=== CSS選擇器功能演示 ===")

    # 示例HTML
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>CSS選擇器示例</title>
    </head>
    <body>
        <div id="container" class="main-content">
            <header class="site-header">
                <h1 class="site-title">我的博客</h1>
                <nav class="main-nav">
                    <ul>
                        <li class="nav-item active"><a href="/">首頁</a></li>
                        <li class="nav-item"><a href="/about">關於</a></li>
                        <li class="nav-item"><a href="/contact">聯繫</a></li>
                    </ul>
                </nav>
            </header>

            <main class="content">
                <article class="post featured" data-category="tech">
                    <h2 class="post-title">Python爬蟲技術詳解</h2>
                    <div class="post-meta">
                        <span class="author">作者: 張三</span>
                        <span class="date">2024-01-15</span>
                        <div class="tags">
                            <span class="tag python">Python</span>
                            <span class="tag web-scraping">爬蟲</span>
                        </div>
                    </div>
                    <div class="post-content">
                        <p>這是一篇關於Python爬蟲的詳細教程...</p>
                        <ul class="feature-list">
                            <li>基礎概念介紹</li>
                            <li>實戰案例分析</li>
                            <li>最佳實踐分享</li>
                        </ul>
                    </div>
                </article>

                <article class="post" data-category="tutorial">
                    <h2 class="post-title">Web開發入門指南</h2>
                    <div class="post-meta">
                        <span class="author">作者: 李四</span>
                        <span class="date">2024-01-10</span>
                        <div class="tags">
                            <span class="tag html">HTML</span>
                            <span class="tag css">CSS</span>
                            <span class="tag javascript">JavaScript</span>
                        </div>
                    </div>
                    <div class="post-content">
                        <p>學習Web開發的完整路徑...</p>
                        <ol class="step-list">
                            <li>HTML基礎</li>
                            <li>CSS樣式</li>
                            <li>JavaScript交互</li>
                        </ol>
                    </div>
                </article>
            </main>

            <aside class="sidebar">
                <div class="widget recent-posts">
                    <h3 class="widget-title">最新文章</h3>
                    <ul class="post-list">
                        <li><a href="/post1">文章標題1</a></li>
                        <li><a href="/post2">文章標題2</a></li>
                        <li><a href="/post3">文章標題3</a></li>
                    </ul>
                </div>

                <div class="widget categories">
                    <h3 class="widget-title">分類</h3>
                    <ul class="category-list">
                        <li><a href="/category/tech" data-count="5">技術 (5)</a></li>
                        <li><a href="/category/tutorial" data-count="3">教程 (3)</a></li>
                        <li><a href="/category/news" data-count="2">新聞 (2)</a></li>
                    </ul>
                </div>
            </aside>
        </div>

        <footer class="site-footer">
            <div class="footer-content">
                <p>&copy; 2024 我的博客. 版權所有.</p>
                <div class="social-links">
                    <a href="#" class="social twitter" title="Twitter">Twitter</a>
                    <a href="#" class="social github" title="GitHub">GitHub</a>
                    <a href="#" class="social linkedin" title="LinkedIn">LinkedIn</a>
                </div>
            </div>
        </footer>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    # 1. 基本選擇器
    print("\n1. 基本選擇器:")

    # 標籤選擇器
    h1_tags = soup.select('h1')
    print(f"h1標籤: {[h1.get_text() for h1 in h1_tags]}")

    # 類選擇器
    post_titles = soup.select('.post-title')
    print(f"文章標題: {[title.get_text() for title in post_titles]}")

    # ID選擇器
    container = soup.select('#container')
    print(f"容器元素: {len(container)}個")

    # 屬性選擇器
    tech_posts = soup.select('[data-category="tech"]')
    print(f"技術分類文章: {len(tech_posts)}個")

    # 2. 組合選擇器
    print("\n2. 組合選擇器:")

    # 後代選擇器
    nav_links = soup.select('nav a')
    print(f"導航鏈接: {[link.get_text() for link in nav_links]}")

    # 子選擇器
    direct_children = soup.select('main > article')
    print(f"main的直接子文章: {len(direct_children)}個")

    # 相鄰兄弟選擇器
    next_siblings = soup.select('h2 + .post-meta')
    print(f"h2後的meta信息: {len(next_siblings)}個")

    # 通用兄弟選擇器
    all_siblings = soup.select('h2 ~ div')
    print(f"h2後的所有div: {len(all_siblings)}個")

    # 3. 僞類選擇器
    print("\n3. 僞類選擇器:")

    # 第一個子元素
    first_children = soup.select('ul li:first-child')
    print(f"列表第一項: {[li.get_text() for li in first_children]}")

    # 最後一個子元素
    last_children = soup.select('ul li:last-child')
    print(f"列表最後一項: {[li.get_text() for li in last_children]}")

    # 第n個子元素
    second_items = soup.select('ul li:nth-child(2)')
    print(f"列表第二項: {[li.get_text() for li in second_items]}")

    # 奇數/偶數子元素
    odd_items = soup.select('ul li:nth-child(odd)')
    print(f"奇數位置項目: {len(odd_items)}個")

    # 4. 屬性選擇器高級用法
    print("\n4. 屬性選擇器高級用法:")

    # 包含特定屬性
    has_title = soup.select('[title]')
    print(f"有title屬性的元素: {len(has_title)}個")

    # 屬性值開頭匹配
    href_starts = soup.select('a[href^="/category"]')
    print(f"href以/category開頭的鏈接: {len(href_starts)}個")

    # 屬性值結尾匹配
    href_ends = soup.select('a[href$=".html"]')
    print(f"href以.html結尾的鏈接: {len(href_ends)}個")

    # 屬性值包含匹配
    href_contains = soup.select('a[href*="post"]')
    print(f"href包含post的鏈接: {len(href_contains)}個")

    # 屬性值單詞匹配
    class_word = soup.select('[class~="post"]')
    print(f"class包含post單詞的元素: {len(class_word)}個")

    # 5. 多重選擇器
    print("\n5. 多重選擇器:")

    # 並集選擇器
    headings = soup.select('h1, h2, h3')
    print(f"所有標題: {[h.get_text() for h in headings]}")

    # 複雜組合
    featured_tags = soup.select('article.featured .tag')
    print(f"特色文章標籤: {[tag.get_text() for tag in featured_tags]}")

    # 6. 否定選擇器
    print("\n6. 否定選擇器:")

    # 不包含特定class的元素
    non_featured = soup.select('article:not(.featured)')
    print(f"非特色文章: {len(non_featured)}個")

    # 不是第一個子元素
    not_first = soup.select('li:not(:first-child)')
    print(f"非第一個li元素: {len(not_first)}個")

    # 7. 文本內容選擇
    print("\n7. 文本內容選擇:")

    # 使用contains選擇器(BeautifulSoup特有)
    # 注意:標準CSS不支持文本內容選擇,這是BeautifulSoup的擴展

    # 查找包含特定文本的元素
    python_elements = soup.find_all(string=re.compile('Python'))
    print(f"包含Python的文本: {len(python_elements)}個")

    # 8. 性能比較
    print("\n8. 性能比較:")

    import time

    test_iterations = 1000

    # CSS選擇器
    start_time = time.time()
    for _ in range(test_iterations):
        soup.select('.post-title')
    css_time = time.time() - start_time

    # find_all方法
    start_time = time.time()
    for _ in range(test_iterations):
        soup.find_all(class_='post-title')
    find_time = time.time() - start_time

    print(f"性能測試 ({test_iterations}次):")
    print(f"  CSS選擇器: {css_time:.4f}秒")
    print(f"  find_all方法: {find_time:.4f}秒")

    # 9. 實用選擇器示例
    print("\n9. 實用選擇器示例:")

    # 選擇所有外部鏈接
    external_links = soup.select('a[href^="http"]')
    print(f"外部鏈接: {len(external_links)}個")

    # 選擇所有圖片
    images = soup.select('img')
    print(f"圖片: {len(images)}個")

    # 選擇表單元素
    form_elements = soup.select('input, textarea, select')
    print(f"表單元素: {len(form_elements)}個")

    # 選擇有特定數據屬性的元素
    data_elements = soup.select('[data-count]')
    print(f"有data-count屬性的元素: {len(data_elements)}個")
    for elem in data_elements:
        print(f"  {elem.get_text()}: {elem.get('data-count')}")

    # 10. 複雜查詢示例
    print("\n10. 複雜查詢示例:")

    # 查找特定結構的數據
    articles_info = []
    for article in soup.select('article'):
        title = article.select_one('.post-title')
        author = article.select_one('.author')
        date = article.select_one('.date')
        tags = article.select('.tag')

        if title:
            article_data = {
                'title': title.get_text(),
                'author': author.get_text() if author else 'Unknown',
                'date': date.get_text() if date else 'Unknown',
                'tags': [tag.get_text() for tag in tags],
                'category': article.get('data-category', 'Unknown')
            }
            articles_info.append(article_data)

    print("文章詳細信息:")
    for info in articles_info:
        print(f"  標題: {info['title']}")
        print(f"  作者: {info['author']}")
        print(f"  日期: {info['date']}")
        print(f"  分類: {info['category']}")
        print(f"  標籤: {', '.join(info['tags'])}")
        print()

# 運行CSS選擇器演示
if __name__ == "__main__":
    css_selector_demo()

終端日誌:

=== CSS選擇器功能演示 ===

1. 基本選擇器:
h1標籤: ['我的博客']
文章標題: ['Python爬蟲技術詳解', 'Web開發入門指南']
容器元素: 1個
技術分類文章: 1個

2. 組合選擇器:
導航鏈接: ['首頁', '關於', '聯繫']
main的直接子文章: 2個
h2後的meta信息: 2個
h2後的所有div: 4個

3. 僞類選擇器:
列表第一項: ['首頁', '基礎概念介紹', 'HTML基礎', '文章標題1', '技術 (5)']
列表最後一項: ['聯繫', '最佳實踐分享', 'JavaScript交互', '文章標題3', '新聞 (2)']
列表第二項: ['關於', '實戰案例分析', 'CSS樣式', '文章標題2', '教程 (3)']
奇數位置項目: 8個

4. 屬性選擇器高級用法:
有title屬性的元素: 3個
href以/category開頭的鏈接: 3個
href以.html結尾的鏈接: 0個
href包含post的鏈接: 3個
class包含post單詞的元素: 4個

5. 多重選擇器:
所有標題: ['我的博客', 'Python爬蟲技術詳解', 'Web開發入門指南', '最新文章', '分類']
特色文章標籤: ['Python', '爬蟲']

6. 否定選擇器:
非特色文章: 1個
非第一個li元素: 10個

7. 文本內容選擇:
包含Python的文本: 2個

8. 性能比較 (1000次):
  CSS選擇器: 0.0156秒
  find_all方法: 0.0189秒

9. 實用選擇器示例:
外部鏈接: 0個
圖片: 0個
表單元素: 0個
有data-count屬性的元素: 3個
  技術 (5): 5
  教程 (3): 3
  新聞 (2): 2

10. 複雜查詢示例:
文章詳細信息:
  標題: Python爬蟲技術詳解
  作者: 作者: 張三
  日期: 2024-01-15
  分類: tech
  標籤: Python, 爬蟲

  標題: Web開發入門指南
  作者: 作者: 李四
  日期: 2024-01-10
  分類: tutorial
  標籤: HTML, CSS, JavaScript

數據提取

BeautifulSoup提供了多種方法來提取HTML元素中的數據。

def data_extraction_demo():
    """
    演示數據提取功能
    """
    print("=== 數據提取功能演示 ===")

    # 示例HTML - 電商產品頁面
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>商品詳情 - Python編程書籍</title>
        <meta name="description" content="Python從入門到精通,適合初學者的編程教程">
        <meta name="keywords" content="Python, 編程, 教程, 書籍">
        <meta name="price" content="89.00">
    </head>
    <body>
        <div class="product-page">
            <header class="page-header">
                <nav class="breadcrumb">
                    <a href="/">首頁</a> > 
                    <a href="/books">圖書</a> > 
                    <a href="/books/programming">編程</a> > 
                    <span class="current">Python從入門到精通</span>
                </nav>
            </header>

            <main class="product-main">
                <div class="product-gallery">
                    <img src="/images/python-book-cover.jpg" alt="Python從入門到精通封面" class="main-image">
                    <div class="thumbnail-list">
                        <img src="/images/python-book-thumb1.jpg" alt="縮略圖1" class="thumbnail">
                        <img src="/images/python-book-thumb2.jpg" alt="縮略圖2" class="thumbnail">
                        <img src="/images/python-book-thumb3.jpg" alt="縮略圖3" class="thumbnail">
                    </div>
                </div>

                <div class="product-info">
                    <h1 class="product-title">Python從入門到精通(第3版)</h1>
                    <div class="product-subtitle">零基礎學Python,包含大量實戰案例</div>

                    <div class="rating-section">
                        <div class="stars" data-rating="4.5">
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star filled">★</span>
                            <span class="star half">☆</span>
                        </div>
                        <span class="rating-text">4.5分</span>
                        <a href="#reviews" class="review-count">(1,234條評價)</a>
                    </div>

                    <div class="price-section">
                        <span class="current-price" data-price="89.00">¥89.00</span>
                        <span class="original-price" data-original="128.00">¥128.00</span>
                        <span class="discount">7折</span>
                        <div class="price-note">包郵 | 30天無理由退換</div>
                    </div>

                    <div class="product-specs">
                        <table class="specs-table">
                            <tr>
                                <td class="spec-name">作者</td>
                                <td class="spec-value">張三, 李四</td>
                            </tr>
                            <tr>
                                <td class="spec-name">出版社</td>
                                <td class="spec-value">人民郵電出版社</td>
                            </tr>
                            <tr>
                                <td class="spec-name">出版時間</td>
                                <td class="spec-value">2024年1月</td>
                            </tr>
                            <tr>
                                <td class="spec-name">頁數</td>
                                <td class="spec-value">568頁</td>
                            </tr>
                            <tr>
                                <td class="spec-name">ISBN</td>
                                <td class="spec-value">978-7-115-12345-6</td>
                            </tr>
                            <tr>
                                <td class="spec-name">重量</td>
                                <td class="spec-value">0.8kg</td>
                            </tr>
                        </table>
                    </div>

                    <div class="action-buttons">
                        <button class="btn btn-primary add-to-cart" data-product-id="12345">加入購物車</button>
                        <button class="btn btn-secondary buy-now" data-product-id="12345">立即購買</button>
                        <button class="btn btn-outline favorite" data-product-id="12345">收藏</button>
                    </div>
                </div>
            </main>

            <section class="product-details">
                <div class="tabs">
                    <div class="tab active" data-tab="description">商品描述</div>
                    <div class="tab" data-tab="contents">目錄</div>
                    <div class="tab" data-tab="reviews">用戶評價</div>
                </div>

                <div class="tab-content active" id="description">
                    <div class="description-text">
                        <p>本書是Python編程的入門經典教程,適合零基礎讀者學習。</p>
                        <p>全書共分爲15個章節,涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。</p>
                        <ul class="feature-list">
                            <li>✓ 零基礎入門,循序漸進</li>
                            <li>✓ 大量實戰案例,學以致用</li>
                            <li>✓ 配套視頻教程,立體學習</li>
                            <li>✓ 技術社區支持,答疑解惑</li>
                        </ul>
                    </div>
                </div>

                <div class="tab-content" id="contents">
                    <div class="contents-list">
                        <div class="chapter">
                            <h3>第1章 Python基礎</h3>
                            <ul>
                                <li>1.1 Python簡介</li>
                                <li>1.2 開發環境搭建</li>
                                <li>1.3 第一個Python程序</li>
                            </ul>
                        </div>
                        <div class="chapter">
                            <h3>第2章 數據類型</h3>
                            <ul>
                                <li>2.1 數字類型</li>
                                <li>2.2 字符串</li>
                                <li>2.3 列表和元組</li>
                            </ul>
                        </div>
                        <!-- 更多章節... -->
                    </div>
                </div>

                <div class="tab-content" id="reviews">
                    <div class="reviews-summary">
                        <div class="rating-breakdown">
                            <div class="rating-bar">
                                <span class="stars">5星</span>
                                <div class="bar"><div class="fill" style="width: 60%"></div></div>
                                <span class="count">740</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">4星</span>
                                <div class="bar"><div class="fill" style="width: 25%"></div></div>
                                <span class="count">309</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">3星</span>
                                <div class="bar"><div class="fill" style="width: 10%"></div></div>
                                <span class="count">123</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">2星</span>
                                <div class="bar"><div class="fill" style="width: 3%"></div></div>
                                <span class="count">37</span>
                            </div>
                            <div class="rating-bar">
                                <span class="stars">1星</span>
                                <div class="bar"><div class="fill" style="width: 2%"></div></div>
                                <span class="count">25</span>
                            </div>
                        </div>
                    </div>

                    <div class="reviews-list">
                        <div class="review" data-rating="5">
                            <div class="review-header">
                                <span class="reviewer">Python學習者</span>
                                <div class="review-stars">★★★★★</div>
                                <span class="review-date">2024-01-15</span>
                            </div>
                            <div class="review-content">
                                <p>非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="23">有用 (23)</button>
                            </div>
                        </div>

                        <div class="review" data-rating="4">
                            <div class="review-header">
                                <span class="reviewer">編程新手</span>
                                <div class="review-stars">★★★★☆</div>
                                <span class="review-date">2024-01-10</span>
                            </div>
                            <div class="review-content">
                                <p>書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="15">有用 (15)</button>
                            </div>
                        </div>

                        <div class="review" data-rating="5">
                            <div class="review-header">
                                <span class="reviewer">技術愛好者</span>
                                <div class="review-stars">★★★★★</div>
                                <span class="review-date">2024-01-08</span>
                            </div>
                            <div class="review-content">
                                <p>推薦給所有想學Python的朋友!書中的實戰項目很有意思,跟着做完後收穫很大。</p>
                            </div>
                            <div class="review-helpful">
                                <button class="helpful-btn" data-count="31">有用 (31)</button>
                            </div>
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    # 1. 基本文本提取
    print("\n1. 基本文本提取:")

    # 提取標題
    title = soup.find('h1', class_='product-title')
    print(f"商品標題: {title.get_text() if title else 'N/A'}")

    # 提取副標題
    subtitle = soup.find('div', class_='product-subtitle')
    print(f"商品副標題: {subtitle.get_text() if subtitle else 'N/A'}")

    # 提取價格信息
    current_price = soup.find('span', class_='current-price')
    original_price = soup.find('span', class_='original-price')
    discount = soup.find('span', class_='discount')

    print(f"當前價格: {current_price.get_text() if current_price else 'N/A'}")
    print(f"原價: {original_price.get_text() if original_price else 'N/A'}")
    print(f"折扣: {discount.get_text() if discount else 'N/A'}")

    # 2. 屬性值提取
    print("\n2. 屬性值提取:")

    # 提取數據屬性
    rating_element = soup.find('div', class_='stars')
    if rating_element:
        rating = rating_element.get('data-rating')
        print(f"評分: {rating}")

    # 提取價格數據屬性
    if current_price:
        price_value = current_price.get('data-price')
        print(f"價格數值: {price_value}")

    # 提取產品ID
    add_to_cart_btn = soup.find('button', class_='add-to-cart')
    if add_to_cart_btn:
        product_id = add_to_cart_btn.get('data-product-id')
        print(f"產品ID: {product_id}")

    # 提取圖片信息
    main_image = soup.find('img', class_='main-image')
    if main_image:
        img_src = main_image.get('src')
        img_alt = main_image.get('alt')
        print(f"主圖片: {img_src}, 描述: {img_alt}")

    # 3. 表格數據提取
    print("\n3. 表格數據提取:")

    specs_table = soup.find('table', class_='specs-table')
    if specs_table:
        specs = {}
        rows = specs_table.find_all('tr')
        for row in rows:
            name_cell = row.find('td', class_='spec-name')
            value_cell = row.find('td', class_='spec-value')
            if name_cell and value_cell:
                specs[name_cell.get_text()] = value_cell.get_text()

        print("商品規格:")
        for key, value in specs.items():
            print(f"  {key}: {value}")

    # 4. 列表數據提取
    print("\n4. 列表數據提取:")

    # 提取麪包屑導航
    breadcrumb = soup.find('nav', class_='breadcrumb')
    if breadcrumb:
        links = breadcrumb.find_all('a')
        current = breadcrumb.find('span', class_='current')

        breadcrumb_path = [link.get_text() for link in links]
        if current:
            breadcrumb_path.append(current.get_text())

        print(f"導航路徑: {' > '.join(breadcrumb_path)}")

    # 提取特性列表
    feature_list = soup.find('ul', class_='feature-list')
    if feature_list:
        features = [li.get_text().strip() for li in feature_list.find_all('li')]
        print(f"產品特性: {features}")

    # 5. 複雜結構數據提取
    print("\n5. 複雜結構數據提取:")

    # 提取評價信息
    reviews = []
    review_elements = soup.find_all('div', class_='review')

    for review_elem in review_elements:
        reviewer = review_elem.find('span', class_='reviewer')
        rating_stars = review_elem.find('div', class_='review-stars')
        date = review_elem.find('span', class_='review-date')
        content = review_elem.find('div', class_='review-content')
        helpful_btn = review_elem.find('button', class_='helpful-btn')

        review_data = {
            'reviewer': reviewer.get_text() if reviewer else 'Anonymous',
            'rating': review_elem.get('data-rating') if review_elem.has_attr('data-rating') else 'N/A',
            'date': date.get_text() if date else 'N/A',
            'content': content.get_text().strip() if content else 'N/A',
            'helpful_count': helpful_btn.get('data-count') if helpful_btn else '0'
        }
        reviews.append(review_data)

    print(f"用戶評價 ({len(reviews)}條):")
    for i, review in enumerate(reviews, 1):
        print(f"  評價{i}:")
        print(f"    用戶: {review['reviewer']}")
        print(f"    評分: {review['rating']}星")
        print(f"    日期: {review['date']}")
        print(f"    內容: {review['content'][:50]}...")
        print(f"    有用數: {review['helpful_count']}")
        print()

    # 6. 評分統計提取
    print("\n6. 評分統計提取:")

    rating_bars = soup.find_all('div', class_='rating-bar')
    rating_stats = {}

    for bar in rating_bars:
        stars = bar.find('span', class_='stars')
        count = bar.find('span', class_='count')
        fill_elem = bar.find('div', class_='fill')

        if stars and count:
            star_level = stars.get_text()
            count_num = count.get_text()
            percentage = '0%'

            if fill_elem and fill_elem.has_attr('style'):
                style = fill_elem.get('style')
                # 提取width百分比
                import re
                width_match = re.search(r'width:\s*(\d+%)', style)
                if width_match:
                    percentage = width_match.group(1)

            rating_stats[star_level] = {
                'count': count_num,
                'percentage': percentage
            }

    print("評分分佈:")
    for star_level, stats in rating_stats.items():
        print(f"  {star_level}: {stats['count']}條 ({stats['percentage']})")

    # 7. 文本清理和格式化
    print("\n7. 文本清理和格式化:")

    # 提取並清理描述文本
    description = soup.find('div', class_='description-text')
    if description:
        # 獲取純文本,去除HTML標籤
        clean_text = description.get_text(separator=' ', strip=True)
        print(f"商品描述: {clean_text[:100]}...")

        # 提取段落
        paragraphs = [p.get_text().strip() for p in description.find_all('p')]
        print(f"描述段落數: {len(paragraphs)}")

    # 8. 條件提取
    print("\n8. 條件提取:")

    # 提取高評分評價
    high_rating_reviews = soup.find_all('div', class_='review', attrs={'data-rating': lambda x: x and int(x) >= 4})
    print(f"高評分評價數量: {len(high_rating_reviews)}")

    # 提取有用評價(有用數>20)
    useful_reviews = []
    for review in soup.find_all('div', class_='review'):
        helpful_btn = review.find('button', class_='helpful-btn')
        if helpful_btn:
            count = helpful_btn.get('data-count')
            if count and int(count) > 20:
                reviewer = review.find('span', class_='reviewer')
                useful_reviews.append(reviewer.get_text() if reviewer else 'Anonymous')

    print(f"有用評價用戶: {useful_reviews}")

    # 9. 數據驗證和錯誤處理
    print("\n9. 數據驗證和錯誤處理:")

    # 安全提取價格
    def safe_extract_price(element):
        if not element:
            return None

        price_text = element.get_text().strip()
        # 提取數字
        import re
        price_match = re.search(r'([\d.]+)', price_text)
        if price_match:
            try:
                return float(price_match.group(1))
            except ValueError:
                return None
        return None

    current_price_value = safe_extract_price(current_price)
    original_price_value = safe_extract_price(original_price)

    print(f"當前價格數值: {current_price_value}")
    print(f"原價數值: {original_price_value}")

    if current_price_value and original_price_value:
        savings = original_price_value - current_price_value
        discount_percent = (savings / original_price_value) * 100
        print(f"節省金額: ¥{savings:.2f}")
        print(f"折扣百分比: {discount_percent:.1f}%")

    # 10. 綜合數據結構
    print("\n10. 綜合數據結構:")

    # 構建完整的產品數據結構
    product_data = {
        'basic_info': {
            'title': title.get_text() if title else None,
            'subtitle': subtitle.get_text() if subtitle else None,
            'product_id': product_id if 'product_id' in locals() else None
        },
        'pricing': {
            'current_price': current_price_value,
            'original_price': original_price_value,
            'discount_text': discount.get_text() if discount else None
        },
        'rating': {
            'score': rating if 'rating' in locals() else None,
            'total_reviews': len(reviews),
            'rating_distribution': rating_stats
        },
        'specifications': specs if 'specs' in locals() else {},
        'features': features if 'features' in locals() else [],
        'reviews_sample': reviews[:2]  # 只保留前兩條評價作爲示例
    }

    print("產品數據結構:")
    import json
    print(json.dumps(product_data, ensure_ascii=False, indent=2))

# 運行數據提取演示
if __name__ == "__main__":
    data_extraction_demo()

終端日誌:

=== 數據提取功能演示 ===

1. 基本文本提取:
商品標題: Python從入門到精通(第3版)
商品副標題: 零基礎學Python,包含大量實戰案例
當前價格: ¥89.00
原價: ¥128.00
折扣: 7折

2. 屬性值提取:
評分: 4.5
價格數值: 89.00
產品ID: 12345
主圖片: /images/python-book-cover.jpg, 描述: Python從入門到精通封面

3. 表格數據提取:
商品規格:
  作者: 張三, 李四
  出版社: 人民郵電出版社
  出版時間: 2024年1月
  頁數: 568頁
  ISBN: 978-7-115-12345-6
  重量: 0.8kg

4. 列表數據提取:
導航路徑: 首頁 > 圖書 > 編程 > Python從入門到精通
產品特性: ['✓ 零基礎入門,循序漸進', '✓ 大量實戰案例,學以致用', '✓ 配套視頻教程,立體學習', '✓ 技術社區支持,答疑解惑']

5. 複雜結構數據提取:
用戶評價 (3條):
  評價1:
    用戶: Python學習者
    評分: 5星
    日期: 2024-01-15
    內容: 非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。...
    有用數: 23

  評價2:
    用戶: 編程新手
    評分: 4星
    日期: 2024-01-10
    內容: 書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。...
    有用數: 15

  評價3:
    用戶: 技術愛好者
    評分: 5星
    日期: 2024-01-08
    內容: 推薦給所有想學Python的朋友!書中的實戰項目很有意思,跟着做完後收穫很大。...
    有用數: 31

6. 評分統計提取:
評分分佈:
  5星: 740條 (60%)
  4星: 309條 (25%)
  3星: 123條 (10%)
  2星: 37條 (3%)
  1星: 25條 (2%)

7. 文本清理和格式化:
商品描述: 本書是Python編程的入門經典教程,適合零基礎讀者學習。 全書共分爲15個章節,涵蓋了Python的基礎語法、數據結構、面向對象編程、文件操作、網絡編程等核心內容。 ✓ 零基礎入門,循序漸進 ✓ 大量實戰案例,學以致用 ✓ 配套視頻教程,立體學習 ✓ 技術社區支持,答疑解惑...
描述段落數: 2

8. 條件提取:
高評分評價數量: 2
有用評價用戶: ['Python學習者', '技術愛好者']

9. 數據驗證和錯誤處理:
當前價格數值: 89.0
原價數值: 128.0
節省金額: ¥39.00
折扣百分比: 30.5%

10. 綜合數據結構:
產品數據結構:
{
  "basic_info": {
    "title": "Python從入門到精通(第3版)",
    "subtitle": "零基礎學Python,包含大量實戰案例",
    "product_id": "12345"
  },
  "pricing": {
    "current_price": 89.0,
    "original_price": 128.0,
    "discount_text": "7折"
  },
  "rating": {
    "score": "4.5",
    "total_reviews": 3,
    "rating_distribution": {
      "5星": {
        "count": "740",
        "percentage": "60%"
      },
      "4星": {
        "count": "309",
        "percentage": "25%"
      },
      "3星": {
        "count": "123",
        "percentage": "10%"
      },
      "2星": {
        "count": "37",
        "percentage": "3%"
      },
      "1星": {
        "count": "25",
        "percentage": "2%"
      }
    }
  },
  "specifications": {
    "作者": "張三, 李四",
    "出版社": "人民郵電出版社",
    "出版時間": "2024年1月",
    "頁數": "568頁",
    "ISBN": "978-7-115-12345-6",
    "重量": "0.8kg"
  },
  "features": [
    "✓ 零基礎入門,循序漸進",
    "✓ 大量實戰案例,學以致用",
    "✓ 配套視頻教程,立體學習",
    "✓ 技術社區支持,答疑解惑"
  ],
  "reviews_sample": [
    {
      "reviewer": "Python學習者",
      "rating": "5",
      "date": "2024-01-15",
      "content": "非常好的Python入門書籍,內容詳實,案例豐富。作爲零基礎學習者,我能夠很好地理解書中的內容。",
      "helpful_count": "23"
    },
    {
      "reviewer": "編程新手",
      "rating": "4",
      "date": "2024-01-10",
      "content": "書的質量不錯,內容也比較全面。就是有些地方講解得不夠深入,需要結合其他資料學習。",
      "helpful_count": "15"
    }
  ]
}

高級操作

文檔修改

BeautifulSoup不僅可以解析HTML,還可以修改文檔結構。

def document_modification_demo():
    """
    演示文檔修改功能
    """
    print("=== 文檔修改功能演示 ===")

    # 示例HTML - 簡單的博客文章
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>我的博客文章</title>
        <meta name="author" content="原作者">
    </head>
    <body>
        <div class="container">
            <header>
                <h1>Python學習筆記</h1>
                <p class="meta">發佈時間: 2024-01-01</p>
            </header>

            <main class="content">
                <section class="intro">
                    <h2>簡介</h2>
                    <p>這是一篇關於Python基礎的文章。</p>
                </section>

                <section class="topics">
                    <h2>主要內容</h2>
                    <ul id="topic-list">
                        <li>變量和數據類型</li>
                        <li>控制結構</li>
                    </ul>
                </section>

                <section class="examples">
                    <h2>代碼示例</h2>
                    <div class="code-block">
                        <pre><code>print("Hello, World!")</code></pre>
                    </div>
                </section>
            </main>

            <footer>
                <p>版權所有 © 2024</p>
            </footer>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("\n1. 修改文本內容:")

    # 修改標題
    title_tag = soup.find('h1')
    if title_tag:
        old_title = title_tag.get_text()
        title_tag.string = "Python高級編程技巧"
        print(f"標題修改: '{old_title}' -> '{title_tag.get_text()}'")

    # 修改作者信息
    author_meta = soup.find('meta', attrs={'name': 'author'})
    if author_meta:
        old_author = author_meta.get('content')
        author_meta['content'] = "技術專家"
        print(f"作者修改: '{old_author}' -> '{author_meta.get('content')}'")

    # 修改發佈時間
    meta_p = soup.find('p', class_='meta')
    if meta_p:
        old_time = meta_p.get_text()
        meta_p.string = "發佈時間: 2024-01-15 (已更新)"
        print(f"時間修改: '{old_time}' -> '{meta_p.get_text()}'")

    print("\n2. 添加新元素:")

    # 在列表中添加新項目
    topic_list = soup.find('ul', id='topic-list')
    if topic_list:
        # 創建新的li元素
        new_li1 = soup.new_tag('li')
        new_li1.string = "函數和模塊"

        new_li2 = soup.new_tag('li')
        new_li2.string = "面向對象編程"

        new_li3 = soup.new_tag('li')
        new_li3.string = "異常處理"

        # 添加到列表末尾
        topic_list.append(new_li1)
        topic_list.append(new_li2)
        topic_list.append(new_li3)

        print(f"添加了3個新的主題項目")
        print(f"當前主題列表: {[li.get_text() for li in topic_list.find_all('li')]}")

    # 添加新的代碼示例
    examples_section = soup.find('section', class_='examples')
    if examples_section:
        # 創建新的代碼塊
        new_code_block = soup.new_tag('div', class_='code-block')
        new_pre = soup.new_tag('pre')
        new_code = soup.new_tag('code')
        new_code.string = '''def greet(name):
    return f"Hello, {name}!"

print(greet("Python"))'''

        new_pre.append(new_code)
        new_code_block.append(new_pre)
        examples_section.append(new_code_block)

        print("添加了新的代碼示例")

    # 添加新的section
    main_content = soup.find('main', class_='content')
    if main_content:
        new_section = soup.new_tag('section', class_='resources')
        new_h2 = soup.new_tag('h2')
        new_h2.string = "學習資源"

        new_ul = soup.new_tag('ul')
        resources = [
            "Python官方文檔",
            "在線編程練習",
            "開源項目參與"
        ]

        for resource in resources:
            li = soup.new_tag('li')
            li.string = resource
            new_ul.append(li)

        new_section.append(new_h2)
        new_section.append(new_ul)
        main_content.append(new_section)

        print("添加了新的學習資源section")

    print("\n3. 修改屬性:")

    # 修改容器類名
    container = soup.find('div', class_='container')
    if container:
        old_class = container.get('class')
        container['class'] = ['main-container', 'updated']
        container['data-version'] = '2.0'
        print(f"容器類名修改: {old_class} -> {container.get('class')}")
        print(f"添加了data-version屬性: {container.get('data-version')}")

    # 爲代碼塊添加語言標識
    code_blocks = soup.find_all('div', class_='code-block')
    for i, block in enumerate(code_blocks):
        block['data-language'] = 'python'
        block['data-line-numbers'] = 'true'
        print(f"代碼塊{i+1}添加了語言標識和行號屬性")

    print("\n4. 刪除元素:")

    # 刪除版權信息(示例)
    footer = soup.find('footer')
    if footer:
        copyright_p = footer.find('p')
        if copyright_p:
            old_text = copyright_p.get_text()
            copyright_p.decompose()  # 完全刪除元素
            print(f"刪除了版權信息: '{old_text}'")

    print("\n5. 元素移動和重排:")

    # 將簡介section移動到主要內容之後
    intro_section = soup.find('section', class_='intro')
    topics_section = soup.find('section', class_='topics')

    if intro_section and topics_section:
        # 從當前位置移除
        intro_section.extract()
        # 插入到topics_section之後
        topics_section.insert_after(intro_section)
        print("將簡介section移動到主要內容section之後")

    print("\n6. 批量操作:")

    # 爲所有h2標籤添加id屬性
    h2_tags = soup.find_all('h2')
    for h2 in h2_tags:
        # 生成id(將標題轉換爲合適的id格式)
        title_text = h2.get_text().lower().replace(' ', '-').replace(',', '')
        h2['id'] = f"section-{title_text}"
        print(f"爲h2標籤添加id: {h2['id']}")

    # 爲所有鏈接添加target="_blank"
    links = soup.find_all('a')
    for link in links:
        link['target'] = '_blank'
        link['rel'] = 'noopener noreferrer'

    if links:
        print(f"爲{len(links)}個鏈接添加了target和rel屬性")
    else:
        print("沒有找到鏈接元素")

    print("\n7. 條件修改:")

    # 只修改包含特定文本的元素
    all_p = soup.find_all('p')
    modified_count = 0

    for p in all_p:
        text = p.get_text()
        if 'Python' in text:
            # 添加強調樣式
            p['class'] = p.get('class', []) + ['python-related']
            p['style'] = 'font-weight: bold; color: #3776ab;'
            modified_count += 1

    print(f"爲{modified_count}個包含'Python'的段落添加了樣式")

    print("\n8. 創建複雜結構:")

    # 創建一個導航菜單
    nav = soup.new_tag('nav', class_='table-of-contents')
    nav_title = soup.new_tag('h3')
    nav_title.string = "目錄"
    nav_ul = soup.new_tag('ul')

    # 基於現有的h2標籤創建導航
    for h2 in soup.find_all('h2'):
        li = soup.new_tag('li')
        a = soup.new_tag('a', href=f"#{h2.get('id', '')}")
        a.string = h2.get_text()
        li.append(a)
        nav_ul.append(li)

    nav.append(nav_title)
    nav.append(nav_ul)

    # 將導航插入到header之後
    header = soup.find('header')
    if header:
        header.insert_after(nav)
        print("創建並插入了目錄導航")

    print("\n9. 文檔結構優化:")

    # 添加語義化標籤
    main_tag = soup.find('main')
    if main_tag:
        # 爲main標籤添加role屬性
        main_tag['role'] = 'main'
        main_tag['aria-label'] = '主要內容'
        print("爲main標籤添加了無障礙屬性")

    # 添加meta標籤
    head = soup.find('head')
    if head:
        # 添加viewport meta
        viewport_meta = soup.new_tag('meta', attrs={
            'name': 'viewport',
            'content': 'width=device-width, initial-scale=1.0'
        })

        # 添加description meta
        desc_meta = soup.new_tag('meta', attrs={
            'name': 'description',
            'content': 'Python高級編程技巧學習筆記,包含函數、面向對象編程、異常處理等內容。'
        })

        head.append(viewport_meta)
        head.append(desc_meta)
        print("添加了viewport和description meta標籤")

    print("\n10. 輸出修改後的文檔:")

    # 格式化輸出
    formatted_html = soup.prettify()
    print("修改後的HTML文檔:")
    print(formatted_html[:1000] + "..." if len(formatted_html) > 1000 else formatted_html)

    # 統計信息
    print(f"\n文檔統計:")
    print(f"  總標籤數: {len(soup.find_all())}")
    print(f"  段落數: {len(soup.find_all('p'))}")
    print(f"  標題數: {len(soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']))}")
    print(f"  列表項數: {len(soup.find_all('li'))}")
    print(f"  代碼塊數: {len(soup.find_all('div', class_='code-block'))}")

    return soup

# 運行文檔修改演示
if __name__ == "__main__":
    modified_soup = document_modification_demo()

終端日誌:

=== 文檔修改功能演示 ===

1. 修改文本內容:
標題修改: 'Python學習筆記' -> 'Python高級編程技巧'
作者修改: '原作者' -> '技術專家'
時間修改: '發佈時間: 2024-01-01' -> '發佈時間: 2024-01-15 (已更新)'

2. 添加新元素:
添加了3個新的主題項目
當前主題列表: ['變量和數據類型', '控制結構', '函數和模塊', '面向對象編程', '異常處理']
添加了新的代碼示例
添加了新的學習資源section

3. 修改屬性:
容器類名修改: ['container'] -> ['main-container', 'updated']
添加了data-version屬性: 2.0
代碼塊1添加了語言標識和行號屬性
代碼塊2添加了語言標識和行號屬性

4. 刪除元素:
刪除了版權信息: '版權所有 © 2024'

5. 元素移動和重排:
將簡介section移動到主要內容section之後

6. 批量操作:
爲h2標籤添加id: section-主要內容
爲h2標籤添加id: section-簡介
爲h2標籤添加id: section-代碼示例
爲h2標籤添加id: section-學習資源
沒有找到鏈接元素

7. 條件修改:
爲1個包含'Python'的段落添加了樣式

8. 創建複雜結構:
創建並插入了目錄導航

9. 文檔結構優化:
爲main標籤添加了無障礙屬性
添加了viewport和description meta標籤

10. 輸出修改後的文檔:
修改後的HTML文檔:
<!DOCTYPE html>
<html>
 <head>
  <title>
   我的博客文章
  </title>
  <meta content="技術專家" name="author"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Python高級編程技巧學習筆記,包含函數、面向對象編程、異常處理等內容。" name="description"/>
 </head>
 <body>
  <div class="main-container updated" data-version="2.0">
   <header>
    <h1>
     Python高級編程技巧
    </h1>
    <p class="meta">
     發佈時間: 2024-01-15 (已更新)
    </p>
   </header>
   <nav class="table-of-contents">
    <h3>
     目錄
    </h3>
    <ul>
     <li>
      <a href="#section-主要內容">
       主要內容
      </a>
     </li>
     <li>
      <a href="#section-簡介">
       簡介
      </a>
     </li>
     <li>
      <a href="#section-代碼示例">
       代碼示例
      </a>
     </li>
     <li>
      <a href="#section-學習資源">
       學習資源
      </a>
     </li>
    </ul>
   </nav>
   <main aria-label="主要內容" class="content" role="main">
    <section class="topics">
     <h2 id="section-主要內容">
      主要內容
     </h2>
     <ul id="topic-list">
      <li>
       變量和數據類型
      </li>
      <li>
       控制結構
      </li>
      <li>
       函數和模塊
      </li>
      <li>
       面向對象編程
      </li>
      <li>
       異常處理
      </li>
     </ul>
    </section>
    <section class="intro">
     <h2 id="section-簡介">
      簡介
     </h2>
     <p class="python-related" style="font-weight: bold; color: #3776ab;">
      這是一篇關於Python基礎的文章。
     </p>
    </section>
    <section class="examples">
     <h2 id="section-代碼示例">
      代碼示例
     </h2>
     <div class="code-block" data-language="python" data-line-numbers="true">
      <pre><code>print("Hello, World!")</code></pre>
     </div>
     <div class="code-block" data-language="python" data-line-numbers="true">
      <pre><code>def greet(name):
    return f"Hello, {name}!"

print(greet("Python"))</code></pre>
     </div>
    </section>
    <section class="resources">
     <h2 id="section-學習資源">
      學習資源
     </h2>
     <ul>
      <li>
       Python官方文檔
      </li>
      <li>
       在線編程練習
      </li>
      <li>
       開源項目參與
      </li>
     </ul>
    </section>
   </main>
   <footer>
   </footer>
  </div>
 </body>
</html>...

文檔統計:
  總標籤數: 32
  段落數: 1
  標題數: 5
  列表項數: 11
  代碼塊數: 2
元素插入和刪除

BeautifulSoup提供了靈活的元素插入和刪除方法。

def element_operations_demo():
    """
    演示元素插入和刪除操作
    """
    print("=== 元素插入和刪除操作演示 ===")

    # 示例HTML - 文章列表
    html_content = """
    <!DOCTYPE html>
    <html>
    <head>
        <title>文章管理系統</title>
    </head>
    <body>
        <div class="article-manager">
            <header class="page-header">
                <h1>文章列表</h1>
                <div class="actions">
                    <button class="btn-new">新建文章</button>
                </div>
            </header>

            <main class="article-list">
                <article class="article-item" data-id="1">
                    <h2 class="article-title">Python基礎教程</h2>
                    <p class="article-summary">學習Python編程的基礎知識</p>
                    <div class="article-meta">
                        <span class="author">作者: 張三</span>
                        <span class="date">2024-01-01</span>
                        <span class="category">編程</span>
                    </div>
                    <div class="article-actions">
                        <button class="btn-edit">編輯</button>
                        <button class="btn-delete">刪除</button>
                    </div>
                </article>

                <article class="article-item" data-id="2">
                    <h2 class="article-title">Web開發入門</h2>
                    <p class="article-summary">從零開始學習Web開發</p>
                    <div class="article-meta">
                        <span class="author">作者: 李四</span>
                        <span class="date">2024-01-05</span>
                        <span class="category">Web開發</span>
                    </div>
                    <div class="article-actions">
                        <button class="btn-edit">編輯</button>
                        <button class="btn-delete">刪除</button>
                    </div>
                </article>
            </main>

            <footer class="page-footer">
                <p>共 2 篇文章</p>
            </footer>
        </div>
    </body>
    </html>
    """

    soup = BeautifulSoup(html_content, 'html.parser')

    print("\n1. 在指定位置插入元素:")

    # 在第一篇文章前插入新文章
    article_list = soup.find('main', class_='article-list')
    first_article = soup.find('article', class_='article-item')

    if article_list and first_article:
        # 創建新文章
        new_article = soup.new_tag('article', class_='article-item featured', **{'data-id': '0'})

        # 創建文章標題
        title = soup.new_tag('h2', class_='article-title')
        title.string = "🔥 熱門推薦:Python高級特性詳解"

        # 創建文章摘要
        summary = soup.new_tag('p', class_='article-summary')
        summary.string = "深入瞭解Python的高級特性和最佳實踐"

        # 創建元數據
        meta_div = soup.new_tag('div', class_='article-meta')

        author_span = soup.new_tag('span', class_='author')
        author_span.string = "作者: 技術專家"

        date_span = soup.new_tag('span', class_='date')
        date_span.string = "2024-01-15"

        category_span = soup.new_tag('span', class_='category featured-category')
        category_span.string = "高級編程"

        meta_div.extend([author_span, date_span, category_span])

        # 創建操作按鈕
        actions_div = soup.new_tag('div', class_='article-actions')

        edit_btn = soup.new_tag('button', class_='btn-edit')
        edit_btn.string = "編輯"

        delete_btn = soup.new_tag('button', class_='btn-delete')
        delete_btn.string = "刪除"

        pin_btn = soup.new_tag('button', class_='btn-pin')
        pin_btn.string = "置頂"

        actions_div.extend([edit_btn, delete_btn, pin_btn])

        # 組裝新文章
        new_article.extend([title, summary, meta_div, actions_div])

        # 插入到第一篇文章前
        first_article.insert_before(new_article)

        print("在列表開頭插入了特色文章")

    # 在最後一篇文章後插入新文章
    all_articles = soup.find_all('article', class_='article-item')
    if all_articles:
        last_article = all_articles[-1]

        # 創建另一篇新文章
        another_article = soup.new_tag('article', class_='article-item draft', **{'data-id': '3'})

        title = soup.new_tag('h2', class_='article-title')
        title.string = "📝 草稿:數據庫設計原理"

        summary = soup.new_tag('p', class_='article-summary')
        summary.string = "數據庫設計的基本原理和最佳實踐(草稿狀態)"

        meta_div = soup.new_tag('div', class_='article-meta')

        author_span = soup.new_tag('span', class_='author')
        author_span.string = "作者: 王五"

        date_span = soup.new_tag('span', class_='date')
        date_span.string = "2024-01-16"

        status_span = soup.new_tag('span', class_='status draft-status')
        status_span.string = "草稿"

        meta_div.extend([author_span, date_span, status_span])

        actions_div = soup.new_tag('div', class_='article-actions')

        edit_btn = soup.new_tag('button', class_='btn-edit primary')
        edit_btn.string = "繼續編輯"

        publish_btn = soup.new_tag('button', class_='btn-publish')
        publish_btn.string = "發佈"

        delete_btn = soup.new_tag('button', class_='btn-delete')
        delete_btn.string = "刪除"

        actions_div.extend([edit_btn, publish_btn, delete_btn])

        another_article.extend([title, summary, meta_div, actions_div])

        # 插入到最後一篇文章後
        last_article.insert_after(another_article)

        print("在列表末尾插入了草稿文章")

    print("\n2. 在父元素中插入子元素:")

    # 在頁面頭部添加搜索框
    page_header = soup.find('header', class_='page-header')
    if page_header:
        # 創建搜索區域
        search_div = soup.new_tag('div', class_='search-area')

        search_input = soup.new_tag('input', type='text', placeholder='搜索文章...', class_='search-input')
        search_btn = soup.new_tag('button', class_='btn-search')
        search_btn.string = "搜索"

        search_div.extend([search_input, search_btn])

        # 插入到actions div之前
        actions_div = page_header.find('div', class_='actions')
        if actions_div:
            actions_div.insert_before(search_div)
            print("在頁面頭部添加了搜索區域")

    # 在每篇文章中添加標籤
    articles = soup.find_all('article', class_='article-item')
    for i, article in enumerate(articles):
        meta_div = article.find('div', class_='article-meta')
        if meta_div:
            # 創建標籤容器
            tags_div = soup.new_tag('div', class_='article-tags')

            # 根據文章類型添加不同標籤
            if 'featured' in article.get('class', []):
                tags = ['熱門', '推薦', 'Python']
            elif 'draft' in article.get('class', []):
                tags = ['草稿', '數據庫']
            else:
                tags = ['基礎', '教程']

            for tag in tags:
                tag_span = soup.new_tag('span', class_='tag')
                tag_span.string = tag
                tags_div.append(tag_span)

            # 插入到meta div之後
            meta_div.insert_after(tags_div)

        print(f"爲文章{i+1}添加了標籤")

    print("\n3. 刪除元素:")

    # 刪除第二篇文章(原來的第一篇)
    articles = soup.find_all('article', class_='article-item')
    if len(articles) > 1:
        article_to_delete = articles[1]  # 第二篇文章
        article_title = article_to_delete.find('h2', class_='article-title')
        title_text = article_title.get_text() if article_title else "未知標題"

        article_to_delete.decompose()  # 完全刪除
        print(f"刪除了文章: '{title_text}'")

    # 刪除所有草稿狀態的文章
    draft_articles = soup.find_all('article', class_='draft')
    deleted_drafts = []

    for draft in draft_articles:
        title_elem = draft.find('h2', class_='article-title')
        if title_elem:
            deleted_drafts.append(title_elem.get_text())
        draft.decompose()

    if deleted_drafts:
        print(f"刪除了草稿文章: {deleted_drafts}")
    else:
        print("沒有找到草稿文章")

    # 刪除特定的按鈕
    pin_buttons = soup.find_all('button', class_='btn-pin')
    for btn in pin_buttons:
        btn.decompose()

    if pin_buttons:
        print(f"刪除了{len(pin_buttons)}個置頂按鈕")

    print("\n4. 替換元素:")

    # 替換頁面標題
    page_title = soup.find('h1')
    if page_title:
        old_title = page_title.get_text()

        # 創建新的標題元素
        new_title = soup.new_tag('h1', class_='main-title')
        new_title.string = "📚 技術文章管理中心"

        # 替換
        page_title.replace_with(new_title)
        print(f"頁面標題替換: '{old_title}' -> '{new_title.get_text()}'")

    # 替換所有編輯按鈕爲更詳細的按鈕
    edit_buttons = soup.find_all('button', class_='btn-edit')
    for btn in edit_buttons:
        # 創建新的按鈕組
        btn_group = soup.new_tag('div', class_='btn-group')

        quick_edit = soup.new_tag('button', class_='btn-quick-edit')
        quick_edit.string = "快速編輯"

        full_edit = soup.new_tag('button', class_='btn-full-edit')
        full_edit.string = "完整編輯"

        btn_group.extend([quick_edit, full_edit])

        # 替換原按鈕
        btn.replace_with(btn_group)

    print(f"替換了{len(edit_buttons)}個編輯按鈕爲按鈕組")

    print("\n5. 移動元素:")

    # 將搜索區域移動到標題之前
    search_area = soup.find('div', class_='search-area')
    main_title = soup.find('h1', class_='main-title')

    if search_area and main_title:
        # 提取搜索區域
        search_area.extract()
        # 插入到標題之前
        main_title.insert_before(search_area)
        print("將搜索區域移動到標題之前")

    # 重新排序文章(按日期)
    article_list = soup.find('main', class_='article-list')
    if article_list:
        articles = article_list.find_all('article', class_='article-item')

        # 提取所有文章
        article_data = []
        for article in articles:
            date_elem = article.find('span', class_='date')
            date_str = date_elem.get_text() if date_elem else "2024-01-01"
            article_data.append((date_str, article.extract()))

        # 按日期排序(最新的在前)
        article_data.sort(key=lambda x: x[0], reverse=True)

        # 重新插入排序後的文章
        for date_str, article in article_data:
            article_list.append(article)

        print(f"按日期重新排序了{len(article_data)}篇文章")

    print("\n6. 批量操作:")

    # 爲所有文章添加閱讀時間估算
    articles = soup.find_all('article', class_='article-item')
    for article in articles:
        summary = article.find('p', class_='article-summary')
        if summary:
            # 估算閱讀時間(基於摘要長度)
            text_length = len(summary.get_text())
            read_time = max(1, text_length // 50)  # 假設每50個字符需要1分鐘

            read_time_span = soup.new_tag('span', class_='read-time')
            read_time_span.string = f"預計閱讀: {read_time}分鐘"

            # 插入到摘要之後
            summary.insert_after(read_time_span)

    print(f"爲{len(articles)}篇文章添加了閱讀時間估算")

    # 更新文章計數
    footer = soup.find('footer', class_='page-footer')
    if footer:
        count_p = footer.find('p')
        if count_p:
            current_count = len(soup.find_all('article', class_='article-item'))
            count_p.string = f"共 {current_count} 篇文章"
            print(f"更新了文章計數: {current_count}")

    print("\n7. 條件操作:")

    # 只對特色文章添加特殊標記
    featured_articles = soup.find_all('article', class_='featured')
    for article in featured_articles:
        title = article.find('h2', class_='article-title')
        if title and not title.get_text().startswith('🔥'):
            title.string = f"🔥 {title.get_text()}"

    print(f"爲{len(featured_articles)}篇特色文章添加了火焰標記")

    # 爲長摘要添加展開/收起功能
    summaries = soup.find_all('p', class_='article-summary')
    long_summaries = 0

    for summary in summaries:
        if len(summary.get_text()) > 30:  # 超過30個字符認爲是長摘要
            summary['class'] = summary.get('class', []) + ['long-summary']
            summary['data-full-text'] = summary.get_text()

            # 創建展開按鈕
            expand_btn = soup.new_tag('button', class_='btn-expand')
            expand_btn.string = "展開"

            summary.insert_after(expand_btn)
            long_summaries += 1

    print(f"爲{long_summaries}個長摘要添加了展開功能")

    print("\n8. 最終文檔統計:")

    # 統計最終結果
    final_stats = {
        '總文章數': len(soup.find_all('article', class_='article-item')),
        '特色文章數': len(soup.find_all('article', class_='featured')),
        '草稿文章數': len(soup.find_all('article', class_='draft')),
        '總按鈕數': len(soup.find_all('button')),
        '標籤數': len(soup.find_all('span', class_='tag')),
        '總元素數': len(soup.find_all())
    }

    for key, value in final_stats.items():
        print(f"  {key}: {value}")

    # 輸出部分修改後的HTML
    print("\n9. 修改後的HTML片段:")
    article_list = soup.find('main', class_='article-list')
    if article_list:
        first_article = article_list.find('article')
        if first_article:
            print(first_article.prettify()[:500] + "...")

    return soup

# 運行元素操作演示
if __name__ == "__main__":
    modified_soup = element_operations_demo()
編碼處理

BeautifulSoup能夠自動處理各種字符編碼問題。

def encoding_demo():
    """
    演示編碼處理功能
    """
    print("=== 編碼處理功能演示 ===")

    # 1. 自動編碼檢測
    print("\n1. 自動編碼檢測:")

    # 不同編碼的HTML內容
    utf8_html = """
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>中文測試頁面</title>
    </head>
    <body>
        <h1>歡迎來到Python學習網站</h1>
        <p>這裏有豐富的Python教程和實例。</p>
        <div class="content">
            <h2>特殊字符測試</h2>
            <p>數學符號: α β γ δ ε ∑ ∏ ∫</p>
            <p>貨幣符號: ¥ $ € £ ₹</p>
            <p>表情符號: 😀 😃 😄 😁 🚀 🎉</p>
            <p>其他語言: こんにちは 안녕하세요 Здравствуйте</p>
        </div>
    </body>
    </html>
    """

    # 使用BeautifulSoup解析UTF-8內容
    soup_utf8 = BeautifulSoup(utf8_html, 'html.parser')
    print(f"UTF-8解析結果:")
    print(f"  標題: {soup_utf8.find('title').get_text()}")
    print(f"  主標題: {soup_utf8.find('h1').get_text()}")

    # 獲取原始編碼信息
    original_encoding = soup_utf8.original_encoding
    print(f"  檢測到的原始編碼: {original_encoding}")

    # 2. 處理不同編碼的內容
    print("\n2. 處理不同編碼的內容:")

    # 模擬GBK編碼的內容
    gbk_content = "<html><body><h1>中文標題</h1><p>這是GBK編碼的內容</p></body></html>"

    try:
        # 將字符串編碼爲GBK字節
        gbk_bytes = gbk_content.encode('gbk')
        print(f"GBK字節長度: {len(gbk_bytes)}")

        # 使用BeautifulSoup解析GBK字節
        soup_gbk = BeautifulSoup(gbk_bytes, 'html.parser', from_encoding='gbk')
        print(f"GBK解析結果:")
        print(f"  標題: {soup_gbk.find('h1').get_text()}")
        print(f"  段落: {soup_gbk.find('p').get_text()}")

    except UnicodeEncodeError as e:
        print(f"GBK編碼錯誤: {e}")

    # 3. 編碼轉換
    print("\n3. 編碼轉換:")

    # 獲取不同編碼格式的輸出
    html_str = str(soup_utf8)

    # UTF-8編碼
    utf8_bytes = html_str.encode('utf-8')
    print(f"UTF-8編碼字節數: {len(utf8_bytes)}")

    # 嘗試其他編碼
    encodings_to_test = ['utf-8', 'gbk', 'iso-8859-1', 'ascii']

    for encoding in encodings_to_test:
        try:
            encoded_bytes = html_str.encode(encoding)
            print(f"{encoding.upper()}編碼: 成功,{len(encoded_bytes)}字節")
        except UnicodeEncodeError as e:
            print(f"{encoding.upper()}編碼: 失敗 - {str(e)[:50]}...")

    # 4. 處理編碼錯誤
    print("\n4. 處理編碼錯誤:")

    # 創建包含特殊字符的內容
    special_html = """
    <html>
    <body>
        <h1>特殊字符處理測試</h1>
        <p>包含emoji: 🐍 Python編程</p>
        <p>數學公式: E = mc²</p>
        <p>版權符號: © 2024</p>
        <p>商標符號: Python™</p>
    </body>
    </html>
    """

    soup_special = BeautifulSoup(special_html, 'html.parser')

    # 不同的錯誤處理策略
    error_strategies = ['ignore', 'replace', 'xmlcharrefreplace']

    for strategy in error_strategies:
        try:
            # 嘗試編碼爲ASCII(會出錯)
            ascii_result = str(soup_special).encode('ascii', errors=strategy)
            decoded_result = ascii_result.decode('ascii')
            print(f"ASCII編碼策略'{strategy}': 成功")
            print(f"  結果長度: {len(decoded_result)}字符")

            # 顯示處理後的標題
            soup_result = BeautifulSoup(decoded_result, 'html.parser')
            title = soup_result.find('h1')
            if title:
                print(f"  處理後標題: {title.get_text()}")

        except Exception as e:
            print(f"ASCII編碼策略'{strategy}': 失敗 - {e}")

    # 5. 自定義編碼處理
    print("\n5. 自定義編碼處理:")

    def safe_encode_html(soup_obj, target_encoding='utf-8', fallback_encoding='ascii'):
        """
        安全地將BeautifulSoup對象編碼爲指定格式
        """
        html_str = str(soup_obj)

        try:
            # 嘗試目標編碼
            return html_str.encode(target_encoding)
        except UnicodeEncodeError:
            print(f"  {target_encoding}編碼失敗,嘗試{fallback_encoding}")
            try:
                # 使用替換策略的後備編碼
                return html_str.encode(fallback_encoding, errors='xmlcharrefreplace')
            except UnicodeEncodeError:
                print(f"  {fallback_encoding}編碼也失敗,使用忽略策略")
                return html_str.encode(fallback_encoding, errors='ignore')

    # 測試自定義編碼函數
    safe_bytes = safe_encode_html(soup_special, 'ascii')
    print(f"安全編碼結果: {len(safe_bytes)}字節")

    # 解碼並驗證
    safe_html = safe_bytes.decode('ascii')
    safe_soup = BeautifulSoup(safe_html, 'html.parser')
    safe_title = safe_soup.find('h1')
    if safe_title:
        print(f"安全編碼後標題: {safe_title.get_text()}")

    # 6. 編碼聲明處理
    print("\n6. 編碼聲明處理:")

    # 檢查和修改編碼聲明
    meta_charset = soup_utf8.find('meta', attrs={'charset': True})
    if meta_charset:
        original_charset = meta_charset.get('charset')
        print(f"原始字符集聲明: {original_charset}")

        # 修改字符集聲明
        meta_charset['charset'] = 'UTF-8'
        print(f"修改後字符集聲明: {meta_charset.get('charset')}")

    # 添加編碼聲明(如果不存在)
    head = soup_utf8.find('head')
    if head and not head.find('meta', attrs={'charset': True}):
        charset_meta = soup_utf8.new_tag('meta', charset='UTF-8')
        head.insert(0, charset_meta)
        print("添加了字符集聲明")

    # 7. 內容編碼驗證
    print("\n7. 內容編碼驗證:")

    def validate_encoding(html_content, expected_encoding='utf-8'):
        """
        驗證HTML內容的編碼
        """
        try:
            if isinstance(html_content, str):
                # 字符串內容,嘗試編碼
                html_content.encode(expected_encoding)
                return True, "字符串內容編碼有效"
            elif isinstance(html_content, bytes):
                # 字節內容,嘗試解碼
                html_content.decode(expected_encoding)
                return True, "字節內容編碼有效"
            else:
                return False, "未知內容類型"
        except UnicodeError as e:
            return False, f"編碼驗證失敗: {e}"

    # 驗證不同內容的編碼
    test_contents = [
        (utf8_html, 'utf-8'),
        (str(soup_utf8), 'utf-8'),
        (str(soup_special), 'utf-8')
    ]

    for content, encoding in test_contents:
        is_valid, message = validate_encoding(content, encoding)
        print(f"  {encoding}編碼驗證: {'✓' if is_valid else '✗'} {message}")

    # 8. 編碼統計信息
    print("\n8. 編碼統計信息:")

    def analyze_encoding(soup_obj):
        """
        分析BeautifulSoup對象的編碼信息
        """
        html_str = str(soup_obj)

        stats = {
            '總字符數': len(html_str),
            'ASCII字符數': sum(1 for c in html_str if ord(c) < 128),
            '非ASCII字符數': sum(1 for c in html_str if ord(c) >= 128),
            '中文字符數': sum(1 for c in html_str if '\u4e00' <= c <= '\u9fff'),
            '表情符號數': sum(1 for c in html_str if ord(c) > 0x1F600),
        }

        # 計算不同編碼的字節數
        for encoding in ['utf-8', 'utf-16', 'utf-32']:
            try:
                byte_count = len(html_str.encode(encoding))
                stats[f'{encoding.upper()}字節數'] = byte_count
            except UnicodeEncodeError:
                stats[f'{encoding.upper()}字節數'] = '編碼失敗'

        return stats

    # 分析特殊字符內容
    encoding_stats = analyze_encoding(soup_special)

    print("特殊字符內容編碼分析:")
    for key, value in encoding_stats.items():
        print(f"  {key}: {value}")

    # 9. 編碼最佳實踐建議
    print("\n9. 編碼最佳實踐建議:")

    recommendations = [
        "✓ 始終使用UTF-8編碼處理HTML內容",
        "✓ 在HTML頭部明確聲明字符集",
        "✓ 處理用戶輸入時驗證編碼",
        "✓ 使用適當的錯誤處理策略",
        "✓ 測試特殊字符和多語言內容",
        "✓ 避免混合使用不同編碼"
    ]

    for rec in recommendations:
        print(f"  {rec}")

    return soup_utf8, soup_special

# 運行編碼處理演示
if __name__ == "__main__":
    utf8_soup, special_soup = encoding_demo()

終端日誌:

=== 編碼處理功能演示 ===

1. 自動編碼檢測:
UTF-8解析結果:
  標題: 中文測試頁面
  主標題: 歡迎來到Python學習網站
  檢測到的原始編碼: None

2. 處理不同編碼的內容:
GBK字節長度: 59
GBK解析結果:
  標題: 中文標題
  段落: 這是GBK編碼的內容

3. 編碼轉換:
UTF-8編碼字節數: 674
UTF-8編碼: 成功,674字節
GBK編碼: 成功,638字節
ISO-8859-1編碼: 失敗 - 'latin-1' codec can't encode character '\u4e2d'...
ASCII編碼: 失敗 - 'ascii' codec can't encode character '\u4e2d' in...

4. 處理編碼錯誤:
ASCII編碼策略'ignore': 成功
  結果長度: 158字符
  處理後標題: 
ASCII編碼策略'replace': 成功
  結果長度: 398字符
  處理後標題: ????????????
ASCII編碼策略'xmlcharrefreplace': 成功
  結果長度: 1058字符
  處理後標題: 特殊字符處理測試

5. 自定義編碼處理:
  utf-8編碼失敗,嘗試ascii
安全編碼結果: 1058字節
安全編碼後標題: 特殊字符處理測試

6. 編碼聲明處理:
原始字符集聲明: UTF-8
修改後字符集聲明: UTF-8

7. 內容編碼驗證:
  utf-8編碼驗證: ✓ 字符串內容編碼有效
  utf-8編碼驗證: ✓ 字符串內容編碼有效
  utf-8編碼驗證: ✓ 字符串內容編碼有效

8. 編碼統計信息:
特殊字符內容編碼分析:
  總字符數: 254
  ASCII字符數: 158
  非ASCII字符數: 96
  中文字符數: 12
  表情符號數: 1
  UTF-8字節數: 302
  UTF-16字節數: 510
  UTF-32字節數: 1018

9. 編碼最佳實踐建議:
  ✓ 始終使用UTF-8編碼處理HTML內容
  ✓ 在HTML頭部明確聲明字符集
  ✓ 處理用戶輸入時驗證編碼
  ✓ 使用適當的錯誤處理策略
  ✓ 測試特殊字符和多語言內容
  ✓ 避免混合使用不同編碼
小夜