Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

250x250

Notice

Recent Posts

Recent Comments

Link

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Tags more

Archives

Today

Total

관리 메뉴

파이톨치

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos 본문

카테고리 없음

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

파이톨치 2025. 8. 20. 16:30

728x90

논문 “Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos” 내용을 정리해 드릴게요.

연구 배경

Temporal Sentence Grounding (TSG): 비디오 내 특정 자연어 질의(query)에 해당하는 시간 구간(start–end)을 찾는 문제.
기존 연구는 짧은 비디오 중심 → 긴 비디오(long videos)에서는 다음과 같은 어려움이 있음:
1. 복잡한 맥락: 긴 순간(sequence)에서 시간적 추론 필요.
2. 멀티모달 정보: 긴 영상은 시각(visual) + 음성(speech) 정보를 모두 포함, 이를 효과적으로 활용해야 함.
기존 TSG 모델: 파라미터 많고 연산량 커서 긴 비디오에 비효율적, 일반화 성능도 떨어짐.
기존 MLLM-V(멀티모달 LLM 기반 접근): TSG task와 잘 정렬되지 않아 시간 추론이 약함.

제안 방법: Grounding-Prompter

핵심 아이디어 → LLM을 TSG에 맞게 “프롬프트”로 조정하고, 멀티모달 정보를 효율적으로 활용.

1. Compressed Task Textualization

긴 비디오를 LLM이 이해할 수 있는 텍스트화된 표현으로 변환.
- Speech → ASR로 전사 후 시간 정보 포함.
- Visual → 장면 변화 기반 프레임 샘플링 후 캡션 생성.
중복된 프레임을 최소화하면서 질문과 관련된 정보만 텍스트화 → 긴 맥락을 압축.

2. Boundary-Perceptive Prompting

시간 경계 인식을 강화하기 위한 전략:

Multiscale Denoising Chain-of-Thought (CoT)
① Global 이해 → ② 노이즈 평가(캡션 신뢰성) → ③ Partition별 이해 → ④ 최종 예측.
Validity Principles
- JSON 형식 강제,
- Start < End 보장,
- 예제 정답 단순 복사 금지.
One-shot In-Context Learning (ICL)
- 한 개의 예시를 제공해 LLM이 추론 방식을 학습.

실험

Dataset: VidChapters-mini (VidChapters-7M에서 13–15분짜리 영상 일부 추출).
평가지표: Recall@IoU, mIoU, Recall@seconds, Collapse rate(cr).

🔹 결과

기존 Rule-based, Zero-shot CLIP/BERT, MLLM-V(VideoChat, Video-ChatGPT, Video-LLaMA) 모두 긴 비디오 TSG에 취약.
기존 SOTA 모델 Moment-DETR보다 훈련 없이도 Grounding-Prompter가 여러 지표에서 더 우수.
특히 **r@{n}s (시작점 근접도)**에서 큰 개선.

🔹 Ablation Study

CoT + ICL 둘 다 있어야 성능 최고.
Speech + Visual 결합이 가장 강력 (speech가 더 큰 기여, visual은 노이즈에도 불구하고 IoU 개선에 도움).

결론

Grounding-Prompter는 최초로 LLM을 긴 비디오 TSG에 활용한 방법.
멀티모달 정보(speech + visual)를 텍스트화해 LLM에 입력 → 효율적, 일반화 성능 우수.
Boundary-Perceptive Prompting으로 시간 경계 추론을 강화.
훈련 없이도 기존 fully-trained 모델과 경쟁 가능한 성능 달성.

✅ 한 줄 요약:
긴 비디오에서 자연어 질의를 grounding하기 위해 LLM을 멀티모달 텍스트 입력으로 프롬프트하고, 단계적 추론(CoT)·형식 제약·예시 학습(ICL)으로 시간 경계 인식을 강화한 방법.

728x90

저작자표시 (새창열림)

파이톨치

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos 본문

Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos

연구 배경

제안 방법: Grounding-Prompter

1. Compressed Task Textualization

2. Boundary-Perceptive Prompting

실험

🔹 결과

🔹 Ablation Study

결론

티스토리툴바