告别手动下载！用Python的PubChemPy库批量抓取化合物3D结构（附完整代码与避坑清单）

张

张建站

2026/4/15 22:16:41

10分钟阅读

告别手动下载！用Python的PubChemPy库批量抓取化合物3D结构（附完整代码与避坑清单）

用Python实现化合物3D结构批量下载的工程化实践在药物研发和材料科学领域研究人员经常需要处理大量化合物的3D结构数据。传统的手动下载方式不仅效率低下还容易出错。本文将介绍如何利用Python的PubChemPy库构建一个健壮的批量下载系统显著提升工作效率。1. 环境准备与基础配置1.1 安装PubChemPy库推荐使用conda进行安装可以自动处理依赖关系conda install -c conda-forge pubchempy验证安装是否成功import pubchempy as pcp print(pcp.__version__)1.2 理解PubChem数据库PubChem是全球最大的化学信息数据库之一包含超过1亿种化合物的详细信息。其中3D构象数据对于分子对接、构效关系研究等计算化学应用至关重要。注意并非所有化合物在PubChem中都有3D构象数据特别是新发现的化合物或某些特殊结构的分子。2. 核心下载功能实现2.1 单化合物下载基础代码最基本的3D结构下载代码如下import pubchempy as pcp def download_single_compound(cid, save_path): try: pcp.download(SDF, save_path, overwriteTrue, identifiercid, record_type3d) print(f成功下载CID {cid}的3D结构) return True except pcp.NotFoundError: print(fCID {cid}没有可用的3D构象) return False except Exception as e: print(f下载CID {cid}时发生错误: {str(e)}) return False2.2 批量下载的工程化实现实际工作中我们通常需要处理成百上千个化合物。以下是优化后的批量下载方案import os import time from tqdm import tqdm # 进度条库 def batch_download(cid_list, output_dir, delay0.5): 批量下载化合物3D结构参数: cid_list: 化合物CID列表 output_dir: 保存目录 delay: 请求间隔(秒)避免服务器限制 if not os.path.exists(output_dir): os.makedirs(output_dir) success_count 0 for cid in tqdm(cid_list): save_path os.path.join(output_dir, f{cid}.sdf) if download_single_compound(cid, save_path): success_count 1 time.sleep(delay) # 礼貌性延迟 print(f\n下载完成成功下载{success_count}/{len(cid_list)}个化合物)3. 高级功能与性能优化3.1 多线程加速下载对于大量化合物的下载可以使用多线程提高效率from concurrent.futures import ThreadPoolExecutor def threaded_download(cid_list, output_dir, max_workers4): with ThreadPoolExecutor(max_workersmax_workers) as executor: results list(executor.map( lambda cid: download_single_compound( cid, os.path.join(output_dir, f{cid}.sdf) ), cid_list )) print(f成功下载{sum(results)}/{len(cid_list)}个化合物)3.2 结果验证与质量控制下载完成后建议进行基本验证def validate_sdf_files(output_dir): 验证下载的SDF文件是否有效 invalid_files [] for filename in os.listdir(output_dir): if filename.endswith(.sdf): filepath os.path.join(output_dir, filename) if os.path.getsize(filepath) 1024: # 简单大小检查 invalid_files.append(filename) if invalid_files: print(f发现{len(invalid_files)}个可能无效的文件) return False return True4. 实战案例与问题排查4.1 典型应用场景案例抗病毒药物筛选数据集准备假设我们需要下载50种已知抗病毒化合物的3D结构用于虚拟筛选# 示例CID列表 antiviral_cids [ 121304016, 445639, 445643, 445638, 445640, 445642, 445641, 445644, # ...更多CID ] batch_download(antiviral_cids, ./antiviral_compounds)4.2 常见问题与解决方案问题类型现象解决方案网络超时下载过程中断增加重试机制设置合理延迟CID无效报NotFoundError提前验证CID有效性服务器限制HTTP 429错误降低请求频率使用代理轮换磁盘空间不足写入失败检查目标目录可用空间权限问题无法创建文件确保有写入权限4.3 完整工程化脚本示例以下是一个整合了所有优化措施的完整脚本import os import time import pubchempy as pcp from tqdm import tqdm from concurrent.futures import ThreadPoolExecutor, as_completed class PubChem3DDownloader: def __init__(self, output_diroutput): self.output_dir output_dir os.makedirs(self.output_dir, exist_okTrue) def download_single(self, cid, max_retries3): save_path os.path.join(self.output_dir, f{cid}.sdf) for attempt in range(max_retries): try: pcp.download(SDF, save_path, overwriteTrue, identifiercid, record_type3d) return True except pcp.NotFoundError: return False except Exception as e: if attempt max_retries - 1: print(fCID {cid}下载失败: {str(e)}) return False time.sleep(2 ** attempt) # 指数退避 def batch_download(self, cid_list, max_workers4): with ThreadPoolExecutor(max_workersmax_workers) as executor: futures { executor.submit(self.download_single, cid): cid for cid in cid_list } success 0 for future in tqdm(as_completed(futures), totallen(cid_list)): if future.result(): success 1 time.sleep(0.3) # 控制总体请求速率 print(f下载完成成功率: {success}/{len(cid_list)}) return success在实际项目中这个脚本帮助我们团队将化合物数据准备时间从原来的数小时缩短到几分钟同时显著减少了人为错误。特别是在需要定期更新数据集的研究中自动化脚本的价值更加凸显。

MySQL如何限制触发器递归调用的深度_防止触发器死循环方法

MySQL触发器不支持递归，硬编码限制最多间接递归1层；max_sp_recursion_depth对其无效；应通过标记字段条件判断规避，或改用应用层队列/定时任务。MySQL 触发器递归调用默认是禁用的，max_sp_recursion_depth 不影响触发器…...

2026/4/15 22:14:48 阅读更多 →

用51单片机和Proteus 8.10做个智能浇花系统：从仿真到代码的保姆级避坑指南

用51单片机和Proteus 8.10打造智能浇花系统：从仿真到实物的全流程解析在都市生活的快节奏中，养花种草成为许多人放松身心的方式。但频繁出差或忘记浇水常让心爱的植物遭殃。今天，我们将用经典的51单片机配合Proteus 8.10仿真软件&#xff0c…...

2026/4/15 22:14:03 阅读更多 →

WinUtil：如何快速配置Windows系统的完整工具集指南

WinUtil：如何快速配置Windows系统的完整工具集指南【免费下载链接】winutil Chris Titus Techs Windows Utility - Install Programs, Tweaks, Fixes, and Updates 项目地址: https://gitcode.com/GitHub_Trending/wi/winutil 你是否厌倦了手动安装软件、调…...

2026/4/15 22:11:46 阅读更多 →

HagiCode Desktop 混合分发架构解析：如何用 PP 加速大文件下载籽

一、Actor 模型：不是并发技巧，而是领域单元 Actor 模型的本质是： Actor 是独立运行的实体 Actor 之间只通过消息交互 Actor 内部状态不可被外部直接访问 Actor 自行决定如何处理收到的消息 Actor 模型真正解决的是： 如何在不共享状…...

2026/4/14 21:51:12 阅读更多 →

从数据采集到回放验证：ADTF 适配 ROS 的 ADAS 测试实践饺

一、简化查询 1. 先看一下查询的例子 /// /// 账户获取服务 /// /// /// public class AccountGetService(AccountTable table, IShadowBuilder builder) {private readonly SqlSource _source new(builder.DataSource);private readonly IParamQuery _accountQuery build…...

2026/4/15 6:20:42 阅读更多 →