视频搜索:跨模态视频片段检索与定位

FreeGuideOnline 最新 2026-06-24

python text_features = clip.encode_text("a person opening a fridge") video_features = precomputed_clip_features # shape: (T, D) scores = cosine_similarity(text_features, video_features) pred_start, pred_end = find_best_segment(scores)