Object-Shot Enhanced Grounding Network for Egocentric Video (OSGNet)

250x250

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

파이톨치

Object-Shot Enhanced Grounding Network for Egocentric Video (OSGNet) 본문

논문

Object-Shot Enhanced Grounding Network for Egocentric Video (OSGNet)

파이톨치 2025. 5. 16. 14:57

728x90

🧠 1. 왜 이렇게 접근했나? (동기)

Egocentric (1인칭 시점) 비디오는 일반적인 third-person (exocentric) 영상과 달리 다음과 같은 고유한 특징을 가지고 있어:

카메라가 머리에 달려 있어서 시선 이동 = 카메라 이동임
쿼리에는 배경에 존재하는 오브젝트(예: measuring tape) 같은 fine-grained한 정보가 많음
하지만 기존 방식은 주로 행동(action)에 초점이 맞춰져 있어서, 배경 객체 탐지에 약함
또한 shot segmentation이나 attention shift (사용자의 시선 전환 정보)를 활용하지 않음

🔍 예시 쿼리:
"Where was the measuring tape before I picked the drill with my right hand?"
→ 기존 모델은 drill은 잘 잡지만 measuring tape 같은 배경 객체를 못 찾음

🧩 2. 어떻게 접근했나? (전체 설계 개요)

OSGNet은 두 가지 핵심 인사이트에 기반해 설계됨:

Object-aware grounding: 쿼리에서 언급된 객체(예: measuring tape)를 명시적으로 탐지해서 feature에 포함시킴
Shot-aware contrastive learning: 사용자의 시선 이동(머리 회전 등)을 기반으로 영상의 "shot"을 나누고, 쿼리와의 정렬을 contrastive loss로 학습

💡 전체 구조는 다음 4단계로 나뉨:

(a) Object Extraction: Co-DETR로 객체 탐지 → CLIP으로 텍스트 임베딩 → cross-attention으로 query와 정렬

(b) Feature Extraction: video backbone + CLIP으로 query/video feature 추출

(c) Main Branch: 멀티모달 fusion → multi-scale network → classification + regression head로 시간 범위 예측

(d) Shot Branch: head motion에 따라 shot 나누고 → contrastive loss로 shot-query alignment 학습

🔧 3. 어떤 방법을 썼나? (핵심 기법 + 예시)

🧱 Object Feature Fusion

Object detector (Co-DETR)로 frame에서 쿼리 관련 객체 탐지 (예: "shovel", "tape")
해당 객체의 텍스트 명칭을 CLIP text encoder에 넣어 object feature 생성
query와 cross-attention으로 결합
🎯 예시: “How many drill bits did I remove...” → 객체: drill, drill bit, carton → 역할 다름!

🔀 Multi-modal Fusion (Main Branch)

video feature ↔ query ↔ object feature를 각각 cross-attention
마지막엔 gating mechanism으로 query/object 중요도 조절해 통합

🧠 Shot Branch

LAVILA로 "turns around", "looks around" 같은 caption 생성
해당 표현들을 head motion indicator로 활용해서 shot segmentation 수행
shot feature ↔ query를 **contrastive learning (InfoNCE)**으로 학습
🎯 예시: 여러 shot 중 “I look at the car” 순간이 query와 잘 맞는 shot으로 학습됨

📊 4. 어떤 결과가 나왔나? (성능 비교)

✅ Ego4D-NLQ v2 (주요 벤치마크)

ModelR@1, IoU=0.5R@5, IoU=0.5

GroundVQA	20.23%	37.83%
OSGNet (ours)	22.03%	45.19%

GroundVQA 대비 R@1이 1.8%p, R@5는 무려 7.4%p 상승!

✅ TACoS (exocentric 영상)

ModelR@1, IoU=0.5

SnAG	44.86%
OSGNet	48.18%

→ egocentric 뿐 아니라 일반 영상에서도 SOTA

⚠️ 5. 한계는?

Object detection에 의존: Co-DETR 성능에 따라 결과 품질 편차 가능
Preprocessing 복잡성: 객체 탐지, shot segmentation 모두 추가 단계 필요
이중 브랜치 구조: 메모리와 연산량 증가
Shot segmentation robustness: caption 기반 segmentation이 noise에 민감할 수 있음

📌 요약

항목내용

문제	Egocentric video에서 fine-grained object 중심 query grounding이 어려움
접근	Object-aware + Shot-aware grounding (OSGNet)
방법	객체 탐지 후 cross-attention, head motion 기반 shot segmentation + contrastive learning
결과	Ego4D, Goal-Step, TACoS 모두 SOTA 성능 달성
한계	객체탐지 정확도 의존, 복잡한 구조 및 연산량

필요하면 각 모듈 그림이나 수식도 예시와 함께 더 풀어 설명해줄게. 다음으로 어떤 부분이 더 궁금해?

728x90

저작자표시 (새창열림)

'논문' 카테고리의 다른 글

Video-MME: The First-Ever ComprehensiveEvaluation Benchmark of Multi-modalLLMs in Video Analysis (0)	2025.05.20
RAVU (Retrieval Augmented Video Understanding (0)	2025.05.19
RAVU: Retrieval Augmented Video Understandingwith Compositional Reasoning over Graph (0)	2025.05.16
MoReVQA (0)	2025.05.15
UniVTG: Towards Unified Video-Language Temporal Grounding (1)	2025.05.13

'논문' Related Articles

파이톨치

Object-Shot Enhanced Grounding Network for Egocentric Video (OSGNet) 본문

Object-Shot Enhanced Grounding Network for Egocentric Video (OSGNet)

🧠 1. 왜 이렇게 접근했나? (동기)

🧩 2. 어떻게 접근했나? (전체 설계 개요)

🔧 3. 어떤 방법을 썼나? (핵심 기법 + 예시)

🧱 Object Feature Fusion

🔀 Multi-modal Fusion (Main Branch)

🧠 Shot Branch

📊 4. 어떤 결과가 나왔나? (성능 비교)

✅ Ego4D-NLQ v2 (주요 벤치마크)

✅ TACoS (exocentric 영상)

⚠️ 5. 한계는?

📌 요약

'논문' 카테고리의 다른 글

티스토리툴바