VLA 视觉语言动作模型:从多模态到机器人控制
FreeGuideOnline
最新
2026-06-20
bash
创建 Conda 环境(推荐 Python 3.10)
conda create -n vla python=3.10 -y && conda activate vla
安装 PyTorch(根据 CUDA 版本选择)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
克隆 OpenVLA 库
git clone https://github.com/openvla/openvla.git cd openvla pip install -e .
### 加载预训练模型
```python
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
# 加载 OpenVLA-7b 模型,约需 15 GB 显存
model_id = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda")
执行一次推理
假设机器人环境已提供一张当前观测图像和一条自然语言指令:
from PIL import Image
image = Image.open("observation.png")
prompt = "A task: pick up the red cube and place it on the blue plate."
# 构建模型输入
inputs = processor(prompt, image).to("cuda", dtype=torch.bfloat16)
# 生成动作
with torch.no_grad():
action_ids = model.predict_action(
**inputs,
unnorm_key="bridge_orig", # 根据机器人平台选择归一化参数
max_seq_len=256
)
# action_ids 为包含动作 Token 的列表,需根据本体映射为实际关节值
print("Predicted action tokens:", action_ids)