论文自动摘要：长科学文献的结构化生成式摘要

FreeGuideOnline 最新 2026-06-26

原始PDF/全文 ↓ 文本提取与预处理 ↓ 篇章结构分析（识别IMRaD等章节） ↓ 各章节独立摘要生成（调用生成模型） ↓ 结构化拼接与后处理 ↓ 最终JSON/HTML格式的结构化摘要


#### 3.2 关键模型：长文本摘要模型
科学论文动辄数千词，传统序列到序列（seq2seq）模型受限于上下文长度。当前主流方案：
- **Longformer-Encoder-Decoder (LED)**：基于Longformer的稀疏注意力，可处理长达16K token的输入。
- **BIGBIRD**：同样使用稀疏注意力，适合长文档。
- **PEGASUS-X**：扩展了PEGASUS，支持长输入。
- **大语言模型**：GPT-4、Claude等可接受极长上下文，通过提示词直接生成结构化摘要。

**推荐入门模型：** `allenai/led-base-16384`（可从HuggingFace获取），专门面向科学文献，支持16K输入。

#### 3.3 数据准备与微调
公开数据集推荐：
- **PubMed**：数百万生物医学论文摘要与全文。
- **arXiv**：计算机科学、物理学等领域预印本，自带摘要。
- **SciTLDR**：极短的单句科学摘要，可训练高度凝练能力。
- **PubMedCLIP**：结构化问题-答案对，适合分字段生成。

**微调策略：**
1. 将全文化为固定长度的片段，与目标结构化摘要配对。
2. 针对不同章节（如方法、结果）分别训练独立摘要模型。
3. 使用指针生成网络（Pointer-Generator）复制关键术语（如化学式、数字）以防出错。

### 4. 动手实践：构建一个简单的结构化摘要生成器
本教程使用Python和Hugging Face `transformers`。请确保已安装：
```bash
pip install transformers datasets nltk evaluate

步骤1：加载预训练的长文本摘要模型

from transformers import LEDForConditionalGeneration, LEDTokenizer
import torch

model_name = "allenai/led-base-16384"
tokenizer = LEDTokenizer.from_pretrained(model_name)
model = LEDForConditionalGeneration.from_pretrained(model_name)

步骤2：准备输入文本并设置结构化提示

假设我们从论文中提取了“方法”章节，希望生成对应的结构化摘要。构造提示模板：

section_text = "We trained a BERT-base model on the SQuAD dataset with learning rate 3e-5 for 3 epochs..."
prompt = f"Summarize the METHOD section of a scientific paper:\n\n{section_text}\n\nSummary:"

inputs = tokenizer(prompt, return_tensors="pt", max_length=4096, truncation=True)

步骤3：生成摘要

global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1  # 设置global attention于[CLS] token

summary_ids = model.generate(
    inputs.input_ids,
    global_attention_mask=global_attention_mask,
    max_length=256,
    num_beams=4,
    early_stopping=True
)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

步骤4：批量处理与结构化输出

对论文全文先利用Grobid或spaCy进行章节分割，然后对每个章节调用上述函数，最后组合成字典：

structured_summary = {
    "background": generate_section_summary(background_text, "BACKGROUND"),
    "method": generate_section_summary(method_text, "METHOD"),
    "result": generate_section_summary(result_text, "RESULT"),
    "conclusion": generate_section_summary(conclusion_text, "CONCLUSION")
}