您提到的“1”可能是指希望我提供更具体的爬虫示例或进一步说明某个部分。以下是几个常见的爬虫需求方向,请选择您需要的内容: --- ### **1. 带反爬机制的爬虫(处理 Cookies 和 Headers)** ```python import requests from bs4 import BeautifulSoup url = 'https://example.com' # 设置请求头(模拟浏览器访问) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36' } # 发送请求并携带 headers response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') # 解析内容... else: print(f"请求失败,状态码:{response.status_code}") ``` --- ### **2. 使用 `Scrapy` 框架的完整爬虫示例** Scrapy 是一个专业的爬虫框架,适合复杂项目。以下是基本结构: #### **步骤 1:创建 Scrapy 项目** ```bash scrapy startproject myspider cd myspider ``` #### **步骤 2:编写爬虫代码** 在 `myspider/spiders` 目录下创建 `example_spider.py`: ```python import scrapy class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['https://example.com'] def parse(self, response): # 提取数据 for link in response.css('a::attr(href)'): yield { 'url': link.get() } # 翻页(示例:抓取下一页) next_page = response.css('a.next-page::attr(href)').get() if next_page: yield response.follow(next_page, self.parse) ``` #### **步骤 3:运行爬虫** ```bash scrapy crawl example -o output.json ``` --- ### **3. 处理动态网页(JavaScript 渲染)** 对于使用 JavaScript 动态加载内容的网站(如单页应用),可以用 `Selenium`: ```python from selenium import webdriver from selenium.webdriver.common.by import By import time # 启动浏览器(需提前下载对应浏览器的驱动,如 ChromeDriver) driver = webdriver.Chrome(executable_path='/path/to/chromedriver') url = 'https://example.com' driver.get(url) time.sleep(5) # 等待页面加载 # 提取数据 links = driver.find_elements(By.TAG_NAME, 'a') for link in links: print(link.get_attribute('href')) driver.quit() ``` --- ### **4. 数据存储到 CSV/数据库** #### **保存到 CSV** ```python import csv data = [ {'title': 'Article 1', 'url': 'https://example.com/1'}, {'title': 'Article 2', 'url': 'https://example.com/2'} ] with open('output.csv', 'w', newline='', encoding='utf-8') as f: fieldnames = ['title', 'url'] writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) ``` #### **存储到 MySQL** ```python import mysql.connector # 连接数据库 mydb = mysql.connector.connect( host="localhost", user="yourusername", password="yourpassword", database="mydatabase" ) mycursor = mydb.cursor() # 插入数据 sql = "INSERT INTO articles (title, url) VALUES (%s, %s)" val = ("Article 1", "https://example.com/1") mycursor.execute(sql, val) mydb.commit() ``` --- ### **5. 处理分页和增量爬取** ```python def get_page(url, page_num): params = {'page': page_num} response = requests.get(url, params=params) # 解析当前页的数据... for page in range(1, 10): # 抓取前10页 get_page('https://example.com', page) ``` --- 请告诉我您需要哪一种类型的爬虫示例,或具体说明您的需求(例如目标网站类型、数据格式等),我会提供更针对性的代码! 😊
评论区: