Transformers 实战：文本分类、生成与问答

FreeGuideOnline 最新 2026-06-12

Hugging Face Transformers 自然语言处理实战教程

文本分类、文本生成与问答系统开发入门

Hugging Face Transformers 已经成为自然语言处理（NLP）领域的事实标准库。它提供了数千个预训练模型，并统一了调用接口，让开发者仅用几行代码就能完成文本分类、翻译、摘要生成、问答等复杂任务。本教程将从零开始，带你掌握该库的核心用法，并通过三个实战任务——文本分类、文本生成和阅读理解问答——快速上手现代 NLP。

1. 环境准备与工具安装

首先确保你的 Python 版本 ≥ 3.8。推荐在虚拟环境中安装：

pip install transformers torch

如果你有 NVIDIA GPU 且已安装 CUDA，可直接使用 torch 的 GPU 版本；否则 CPU 版本也能运行大部分示例（速度稍慢）。

验证安装：

import transformers
print(transformers.__version__)

2. 核心概念速览：Pipeline、Model 与 Tokenizer

Transformers 库提供三个重要抽象：

Pipeline：最高层接口，将分词、模型推理、后处理封装为一行调用。
Tokenizer：将原始文本转换为模型可接受的输入 ID（数值张量），同时添加特殊标记（如 [CLS], [SEP]）。
Model：PyTorch 或 TensorFlow 模型，接收 input IDs 并输出预测结果（logits 或生成的 token）。

对于初学者，从 Pipeline 开始是最快的方式。后续若需要更精细的控制（如自定义微调、调整生成参数），再深入底层接口。

3. 文本分类实战：情感分析与自定义类别

3.1 使用 Pipeline 进行情感分析

默认的 Pipeline 会自动下载适用于情感分析的模型（distilbert-base-uncased-finetuned-sst-2-english）：

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I love using Hugging Face Transformers!")
print(result)
# 输出: [{'label': 'POSITIVE', 'score': 0.9998}]

传入多条文本可批量处理：

texts = ["This library is amazing!", "I'm not sure if I like it."]
results = classifier(texts)
for text, res in zip(texts, results):
    print(f"{text} -> {res['label']} ({res['score']:.3f})")

3.2 指定模型与分词器

你可以替换为任意文本分类模型。例如，使用专门平衡情感与讽刺检测的 cardiffnlp/twitter-roberta-base-sentiment-latest：

classifier = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest"
)
print(classifier("Wow, this is just great... not!"))

3.3 零样本分类：无需训练数据

Hugging Face 提供了强大的零样本分类 pipeline，你只需提供候选标签，模型会计算每个标签的匹配概率：

zero_shot = pipeline("zero-shot-classification",
                     model="facebook/bart-large-mnli")
text = "The stock market crashed after the interest rate hike."
labels = ["finance", "sports", "technology", "politics"]

result = zero_shot(text, candidate_labels=labels)
print(result['labels'])
print(result['scores'])

零样本分类非常适合标签动态变化或标注数据稀缺的场景。

3.4 手动分词与模型调用（深入理解）

如果想自己控制流程，可以分离 Tokenizer 和 Model：

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I am so happy today!", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax().item()
    print(model.config.id2label[predicted_class_id])  # POSITIVE

4. 文本生成实战：从自由写作到条件生成

4.1 使用 Pipeline 快速生成

默认使用 GPT-2：

generator = pipeline("text-generation", model="gpt2")
prompt = "In a distant future, artificial intelligence"
output = generator(prompt, max_length=100, num_return_sequences=1)
print(output[0]['generated_text'])

重要参数说明：

max_length：生成文本的总长度上限（包含输入文本）。
num_return_sequences：一次生成几个不同结果。
temperature：控制随机性，较低（如 0.2）使输出更确定；较高（如 0.9）增加多样性。
top_k / top_p：限制每一步候选词的范围，避免重复和不连贯。
do_sample=True：必须开启采样，否则只是贪心解码（确定性输出）。

高级示例：

output = generator(
    prompt,
    max_length=80,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    num_return_sequences=2,
    do_sample=True
)

4.2 更换生成模型

你可以使用任何包含 language model head 的模型，如 GPT-Neo、OPT、BLOOM 等：

generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")

模型较小时（如 distilgpt2）适合快速实验；大模型（如 gpt2-xl 或 gpt-neo-2.7B）生成质量更高但需更多显存。

4.3 条件生成：摘要与翻译

Transformer 模型同样擅长条件生成任务。

摘要：

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
... (长文本) ...
"""
summary = summarizer(article, max_length=130, min_length=30, do_sample=False)
print(summary[0]['summary_text'])

翻译（英语到德语）：

translator = pipeline("translation_en_to_de", model="t5-base")
print(translator("Hello, how are you?", max_length=40)[0]['translation_text'])

5. 问答系统实战：抽取式问答

抽取式问答（Extractive QA）是从给定上下文文本中找出答案的起始和结束位置（即答案是一个连续片段）。最著名的模型是 BERT 在 SQuAD 上的微调版本。

5.1 Pipeline 快速实现

qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
context = """
Hugging Face was founded in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf.
The company is headquartered in New York City and has a strong focus on NLP and open-source software.
"""
question = "Who founded Hugging Face?"

result = qa(question=question, context=context)
print(f"Answer: {result['answer']}, confidence: {result['score']:.4f}")
# Answer: Clément Delangue, Julien Chaumond, and Thomas Wolf, confidence: 0.9998

如果上下文无法回答，模型会输出低置信度分数，并尝试胡乱给出一个片段。实际应用时可设置阈值过滤不可靠的答案。

5.2 批量处理多个问题

循环调用效率较低，但你可以手动将问题与上下文一次性编码：

questions = ["When was Hugging Face founded?", "Where is its headquarters?"]
context = "..."   # 同上

for q in questions:
    res = qa(question=q, context=context)
    print(f"Q: {q}\nA: {res['answer']} (score: {res['score']:.4f})")

5.3 长上下文处理

模型通常限制最大输入长度（如 384 或 512 tokens）。长文本需要切片（chunking）处理。一种简单策略：将文本按滑动窗口分割，同时保持问题不变，对每个块进行问答，最后选取最高分的答案。Hugging Face 的 pipeline 在内部已经处理了这一问题（设置 max_seq_len 和 doc_stride 参数）：

qa = pipeline("question-answering",
              model="bert-large-uncased-whole-word-masking-finetuned-squad",
              max_seq_len=384,
              doc_stride=128)
# 长上下文会自动切分
result = qa(question=question, context=very_long_context)

6. 拓展学习：模型保存与微调

掌握了上述 Pipeline 用法后，你很可能想用自己领域的数据对模型进行微调（fine-tuning）。这里简要介绍微调流程：

6.1 加载预训练模型与分词器

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)

num_labels 参数需要与你的分类任务标签数量一致。

6.2 准备数据集

使用 datasets 库加载或创建数据集，并将文本转化为模型输入：

from datasets import load_dataset

dataset = load_dataset("imdb")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

6.3 定义训练参数并执行训练

Hugging Face 提供了 Trainer 类，与 PyTorch 的优化集成：

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch", num_train_epochs=3)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)
trainer.train()

6.4 保存与重新加载

model.save_pretrained("./my_finetuned_model")
tokenizer.save_pretrained("./my_finetuned_model")

# 重新加载
new_model = AutoModelForSequenceClassification.from_pretrained("./my_finetuned_model")
new_tokenizer = AutoTokenizer.from_pretrained("./my_finetuned_model")

7. 常见问题与性能建议

显存不足：使用 pipeline 时添加 device=-1（强制 CPU）或加载较小模型（如 distilbert 系列）。也可启用模型半精度 torch_dtype=torch.float16。
生成文本重复：增大 temperature，启用 top_k 和 top_p 采样，或设置 repetition_penalty > 1.0。
问答分数低：尝试更换更大模型（如 bert-large-uncased-whole-word-masking-finetuned-squad），或检查上下文是否被截断。
推理速度慢：对于线上服务，考虑使用 ONNX 导出或 Hugging Face 的 TextGenerationInference 等加速方案。

8. 总结

通过本教程，你已经能够：

使用 pipeline 快速调用数千个预训练模型完成文本分类、生成和问答。
掌握零样本分类、情感分析、自由文本生成、摘要翻译与抽取式问答。
了解如何自定义模型、微调并部署自己的 NLP 解决方案。

Hugging Face Transformers 的生态系统持续扩展，建议进一步参考官方文档和模型库，将所学应用于你的实际项目中。