VLA 视觉语言动作模型：从多模态到机器人控制

FreeGuideOnline 最新 2026-06-20

bash

创建 Conda 环境（推荐 Python 3.10）

conda create -n vla python=3.10 -y && conda activate vla

安装 PyTorch（根据 CUDA 版本选择）

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

克隆 OpenVLA 库

git clone https://github.com/openvla/openvla.git cd openvla pip install -e .


### 加载预训练模型

```python
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch

# 加载 OpenVLA-7b 模型，约需 15 GB 显存
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to("cuda")

执行一次推理

假设机器人环境已提供一张当前观测图像和一条自然语言指令：

from PIL import Image
image = Image.open("observation.png")
prompt = "A task: pick up the red cube and place it on the blue plate."

# 构建模型输入
inputs = processor(prompt, image).to("cuda", dtype=torch.bfloat16)

# 生成动作
with torch.no_grad():
    action_ids = model.predict_action(
        **inputs, 
        unnorm_key="bridge_orig",  # 根据机器人平台选择归一化参数
        max_seq_len=256
    )
# action_ids 为包含动作 Token 的列表，需根据本体映射为实际关节值
print("Predicted action tokens:", action_ids)