AI炒股：Deepseek批量下载股票官网的历年年报 - 文章 - 开发者社区

思博伦Spirent Communications官网有公司历年的财报，但是手动下载很麻烦。

picture.image

可以让AI来自动完成。

先观察各年度年报的网址：

https://

corporate.spirent.com/i

nvestors/results/2025

https://

corporate.spirent.com/i

nvestors/results/2024

https://

corporate.spirent.com/i

nvestors/results/2023

https://

corporate.spirent.com/i

nvestors/results/2016

然后在Deepseek中输入提示词：

你是一个Python编程专家，写一个Python脚本，完成一个批量下载网页中pdf文件的任务，具体步骤如下：

打开网页： https://corporate.spirent.com/investors/results/{ year} (参数{year}开始于2016，结束于2024，每次加一)；

在文件夹：”F:\思博伦Spirent Communications”中新建一个子文件夹，文件夹名称为：{year}；

定位网页中的PDF文件和mp3文件，

依次打开每个文件，

在新打开的窗口中定位元素，这是下载按钮；

点击下载按钮，保存文件到这个子文件夹中；

注意：每一步都要输出信息到屏幕上

要有常见的反爬虫应对措施；

要用playwright来处理以上全部网页请求；

picture.image

出错：未找到下载按钮

picture.image

改进点：

直接下载文件：通过 requests 库直接下载文件，而无需点击下载按钮。

补全URL：如果文件链接是相对路径，脚本会自动补全为完整URL。

文件名提取：从文件链接中提取文件名，并保存到对应的年份文件夹中。

错误处理：增加了下载失败和出错的提示信息。

运行程序，再次出现错误信息：

File "C:\Users\dell\AppData\Roaming\Python\Python39\site-packages\playwright\_impl\_connection.py", line 59, in send

return await self._connection.wrap_api_call(

File "C:\Users\dell\AppData\Roaming\Python\Python39\site-packages\playwright\_impl\_connection.py", line 514, in wrap_api_call

raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None

playwright._impl._errors.TimeoutError: Page.goto: Timeout 30000ms exceeded.

Call log:

navigating to "

https://

corporate.spirent.com/i

nvestors/results/2016

", waiting until "load"

picture.image

从错误信息来看，Playwright 在尝试加载网页时超时了。这可能是由于以下原因之一：

网络问题：网页加载速度较慢，导致超时。

网页内容动态加载：网页可能需要执行 JavaScript 或加载动态内容，导致 load 事件未及时触发。

反爬虫机制：网站可能检测到自动化操作并阻止了请求。

解决方案：

增加超时时间：将 goto 方法的超时时间延长。

等待特定元素加载：使用 wait_for_selector 方法等待特定元素加载完成，而不是等待整个页面加载。

模拟用户行为：添加随机延迟和模拟用户行为，以减少被反爬虫机制检测到的风险。

以下是改进后的脚本：

import os

import time

import requests

from playwright.sync_api import sync_playwright

定义下载文件夹的根目录

base_dir = r"F:\思博伦Spirent Communications"

定义年份范围

start_year = 2016

end_year = 2024

定义反爬虫措施：随机延迟

def random_delay():

time.sleep(2 + 1 * (time.time() % 2))

def download_file(url, save_path):

try:

response = requests.get(url, stream=True)

if response.status_code == 200:

with open(save_path, 'wb') as file:

for chunk in response.iter_content(chunk_size=8192):

file.write(chunk)

print(f"文件已下载: {save_path}")

else:

print(f"下载失败: {url} (状态码: {response.status_code})")

except Exception as e:

print(f"下载出错: {url} ({str(e)})")

def download_files(year):

创建年份子文件夹

year_dir = os.path.join(base_dir, str(year))

if not os.path.exists(year_dir):

os.makedirs(year_dir)

print(f"创建文件夹: {year_dir}")

else:

print(f"文件夹已存在: {year_dir}")

打开网页

url = f"

https://

corporate.spirent.com/i

nvestors/results/

{year}"

print(f"正在打开网页: {url}")

with sync_playwright() as p:

browser = p.chromium.launch(headless=False)

page = browser.new_page()

增加超时时间并等待页面加载

try:

page.goto(url, timeout=60000) # 将超时时间增加到60秒

print("网页加载完成")

except Exception as e:

print(f"网页加载失败: {str(e)}")

browser.close()

return

等待特定元素加载（例如页面中的某个标题或链接）

try:

page.wait_for_selector('a[href $=".pdf"], a[href$ =".mp3"]', timeout=30000)

print("文件链接已加载")

except Exception as e:

print(f"等待文件链接加载失败: {str(e)}")

browser.close()

return

定位所有PDF和MP3文件的链接

file_links = page.query_selector_all('a[href $=".pdf"], a[href$ =".mp3"]')

print(f"找到 {len(file_links)} 个文件链接")

for link in file_links:

获取文件链接

file_url = link.get_attribute('href')

if not file_url.startswith('http'):

file_url = 'https:' + file_url # 补全URL

print(f"正在处理文件: {file_url}")

提取文件名

file_name = os.path.basename(file_url)

save_path = os.path.join(year_dir, file_name)

下载文件

download_file(file_url, save_path)

随机延迟

random_delay()

browser.close()

遍历年份并下载文件

for year in range(start_year, end_year + 1):

print(f"开始处理年份: {year}")

download_files(year)

print(f"完成处理年份: {year}\n")

print("所有文件下载完成！")

改进点：

增加超时时间：将 goto 方法的超时时间增加到 60 秒。

等待特定元素加载：使用 wait_for_selector 方法等待文件链接加载完成。

错误处理：增加了网页加载失败和等待元素加载失败的提示信息。

运行脚本：

确保已安装 requests 和 playwright 库，可以通过 pip install requests playwright 安装。

运行脚本后，脚本会自动下载指定年份的 PDF 和 MP3 文件，并保存到相应的子文件夹中。

注意事项：

如果网站有更复杂的反爬虫机制（如验证码或动态加载内容），可能需要进一步调整脚本。

确保网络连接稳定，以避免下载过程中断。

希望这个改进后的脚本能够解决你的问题！如果还有其他需求，请随时告诉我。

picture.image