KoSum

Beyond Highlight Detection: Most-Replayed Driven Multimodal Analysis of Korean YouTube Videos for Highlight Editing Guidance

2026 Spring DSC3028 Capstone Project

Chanhee Lee*†, Jinho Jang*, Sungjun Ha, Jinwoong Jung

Applied Artificial Intelligence, Sungkyunkwan University

* Equal contribution · Team Leader

Abstract

This project proposes a framework for practical video editing guidance. Existing highlight detection and video summarization models remain limited for real-world editing, where creators need actionable guidance rather than simple cut-based automation. We construct KoSum, a Korean YouTube benchmark for video summarization, consisting of recent videos from 2024 to 2026 collected through a structured protocol across 14 fine-grained content categories (e.g., entertainment and cooking) and annotated with triple-modality signals (e.g., visual, audio, and text) and Most-Replayed signals. Using a triple-modality model, we analyze highlight regions through modality attention patterns and feature-level statistics, capturing editing-related multimodal cues (e.g., shot transitions, motion dynamics, subtitle density, and audio novelty). Based on these analyses, we propose a data-analysis-conditioned prompting framework that augments user prompts with category-specific and statistically grounded editing cues. The proposed framework outperforms the baseline by 95 points in total score and 3.17 points in average score, demonstrating strong effectiveness in generating practical and structured editing guidelines. Our framework connects viewer engagement signals, multimodal analysis, and practical editing support for real-world creators. The project page is available at https://iontail.github.io/kosum/.

Overall Pipeline

Scroll to zoom
KoSum framework pipeline
KoSum uses YouTube videos and Most-Replayed signals to train a triple-modality model, analyze highlight regions, and generate category-aware editing guidelines with an LLM.

Demo

The demo shows how a user-provided category, video context, and editing direction can be converted into structured editing guidance.

User Input

Category · Video Context · Editing Direction

Editing Guideline

Waiting for input

Click Generate Guide to preview a structured editing guideline.

KoSum Overview

KoSum contains 700 Korean YouTube videos across 14 categories, with visual, textual, and audio signals aligned to Most-Replayed engagement cues.

700 Videos

Collected from Korean YouTube content uploaded between 2024 and 2026.

14 Categories

Fine-grained content types including entertainment, cooking, sports, knowledge, and lecture videos.

3 Modalities

Visual, audio, and text signals aligned with Most-Replayed importance scores.

KoSum Statistics

Videos average 17 minutes in length and span from 7 to 30 minutes, with full audio availability and dense transcript coverage.

Statistic CategoryValue
Duration Statistics
Avg. Duration1019.3 sec
Std. Duration339.0 sec
Min Duration384.0 sec
Max Duration1800.0 sec
Textual Statistics
Total # of Tokens2.310M
Avg. # of Tokens/Video3299.6
Transcript Density88.18%
Audio Statistics
Audio Availability100%
Video Duration Distribution

Guide Evaluation

The proposed data-analysis-conditioned prompting framework is evaluated against a user-prompt-only baseline using an LLM evaluator. Each sample is scored across editing direction, scene selection clarity, composition specificity, editing detail, content grounding, feasibility, and sample-specific guidance.

Average Score by Evaluation Criterion

Baseline Ours

Average Total Score by Subcategory

Baseline Ours

Citation

BibTeX
@misc{lee2026kosum,
  title  = {Beyond Highlight Detection: Most-Replayed Driven Multimodal Analysis of Korean YouTube Videos for Highlight Editing Guidance},
  author = {Lee, Chanhee and Jang, Jinho and Ha, Sungjun and Jung, Jinwoong},
  year   = {2026},
  note   = {2026 Spring DSC3028 Capstone Project Technical Report, Sungkyunkwan University}
}