Code Llama 实践：Meta 的代码专用大模型

FreeGuideOnline 最新 2026-06-22

bash pip install transformers accelerate torch

若需量化加载，安装 bitsandbytes

pip install bitsandbytes

若需使用文本生成推理服务，可额外安装

pip install text-generation-inference


硬件最低要求参考：

| 模型大小 | 全精度 (FP16) 显存需求 | 8-bit 量化显存需求 | 4-bit 量化显存需求 |
|---------|----------------------|-------------------|-------------------|
| 7B      | ~14 GB               | ~8 GB             | ~6 GB             |
| 13B     | ~26 GB               | ~14 GB            | ~10 GB            |
| 34B     | ~68 GB               | ~35 GB            | ~20 GB            |

### 快速上手：用 Pipeline 生成代码

Hugging Face 的 `transformers` 提供了极简的 `text-generation` pipeline，无需手动编写加载和推理逻辑。

```python
from transformers import pipeline
import torch

# 使用 Code Llama 7B Instruct 作为示例
model_id = "codellama/CodeLlama-7b-Instruct-hf"

generator = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 构造指令格式（Instruct 模型需要特殊标记）
prompt = """<s>[INST] Write a Python function to check if a number is prime. [/INST]"""

result = generator(
    prompt,
    max_new_tokens=256,
    temperature=0.1,
    do_sample=True,
)
print(result[0]["generated_text"])

输出示例：

def is_prime(n: int) -> bool:
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

注意：Instruct 模型要求对话格式严格包裹 [INST] ... [/INST]，基础补全模型则不需要。若使用基础版，直接给出代码前缀即可。

本地加载与量化推理

为在消费级硬件上运行更大模型或节省显存，可使用 bitsandbytes 进行 4-bit 量化。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,                     # 开启 4-bit 量化
    bnb_4bit_compute_dtype=torch.float16,  # 计算精度
    bnb_4bit_use_double_quant=True,        # 双重量化进一步节省显存
)

prompt = "<s>[INST] Explain the difference between list and tuple in Python. [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    temperature=0.2,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

对于基础代码补全任务（非指令），可以直接提供部分代码：

prompt = "def fibonacci(n):\n    # return the n-th fibonacci number\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, stop_strings=["\ndef "])
print(tokenizer.decode(outputs[0]))

使用 Instruct 模型进行对话式编程

Code Llama Instruct 专门优化了多轮对话和指令遵循能力。正确构造对话历史可以保持上下文连贯。

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Rust function to reverse a string."}
]

# 将对话转换为模型要求的格式
def format_chat(messages):
    chat = "<s>"
    for msg in messages:
        if msg["role"] == "system":
            chat += f"[INST] <<SYS>>\n{msg['content']}\n<</SYS>> [/INST]"
        elif msg["role"] == "user":
            chat += f"[INST] {msg['content']} [/INST]"
        else:
            chat += f" {msg['content']} </s><s>"
    return chat

prompt = format_chat(messages)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

后续追问时，将新的问答追加到 messages 列表中，并重新格式化即可。

部署为本地 API 服务

若需要将 Code Llama 嵌入到开发工具或提供 HTTP 接口，可使用轻量级工具：

方案一：Ollama（推荐初学者）

Ollama 支持一键部署 Code Llama。

# 安装 Ollama 后执行
ollama pull codellama:7b-instruct
ollama run codellama:7b-instruct

# 启动后即可通过 REST API 访问：http://localhost:11434/api/generate

示例请求：

curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b-instruct",
  "prompt": "Write a Python script to download a file from URL.",
  "stream": false
}'

方案二：Text Generation Inference（TGI）

生产环境推荐 Hugging Face 的 TGI，支持连续批处理、量化加载和流式输出。

docker run --gpus all -p 8080:80 \
  -v $PWD/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id codellama/CodeLlama-7b-Instruct-hf \
  --quantize bitsandbytes-nf4

之后即可通过标准 OpenAI 兼容口调用，例如配合 Continue.dev 或 Tabby 等编码插件使用。

提示词工程最佳实践

模型输出质量高度依赖提示词构造方式。

代码补全
提供充分的上下文，包括导入语句、类/函数签名和注释，越具体越好。

import requests
from bs4 import BeautifulSoup

# Scrape all headlines from a news website
def fetch_headlines(url):