KoSum
Beyond Highlight Detection: Most-Replayed Driven Multimodal Analysis of Korean YouTube Videos for Highlight Editing Guidance
2026 Spring DSC3028 Capstone Project
Applied Artificial Intelligence, Sungkyunkwan University
Abstract
This project proposes a framework for practical video editing guidance. Existing highlight detection and video summarization models remain limited for real-world editing, where creators need actionable guidance rather than simple cut-based automation. We construct KoSum, a Korean YouTube benchmark for video summarization, consisting of recent videos from 2024 to 2026 collected through a structured protocol across 14 fine-grained content categories (e.g., entertainment and cooking) and annotated with triple-modality signals (e.g., visual, audio, and text) and Most-Replayed signals. Using a triple-modality model, we analyze highlight regions through modality attention patterns and feature-level statistics, capturing editing-related multimodal cues (e.g., shot transitions, motion dynamics, subtitle density, and audio novelty). Based on these analyses, we propose a data-analysis-conditioned prompting framework that augments user prompts with category-specific and statistically grounded editing cues. The proposed framework outperforms the baseline by 95 points in total score and 3.17 points in average score, demonstrating strong effectiveness in generating practical and structured editing guidelines. Our framework connects viewer engagement signals, multimodal analysis, and practical editing support for real-world creators. The project page is available at https://iontail.github.io/kosum/.
Overall Pipeline
Demo
The demo shows how a user-provided category, video context, and editing direction can be converted into structured editing guidance.
User Input
Category · Video Context · Editing Direction
Editing Guideline
Waiting for input
KoSum Overview
KoSum contains 700 Korean YouTube videos across 14 categories, with visual, textual, and audio signals aligned to Most-Replayed engagement cues.
Collected from Korean YouTube content uploaded between 2024 and 2026.
Fine-grained content types including entertainment, cooking, sports, knowledge, and lecture videos.
Visual, audio, and text signals aligned with Most-Replayed importance scores.
KoSum Statistics
Videos average 17 minutes in length and span from 7 to 30 minutes, with full audio availability and dense transcript coverage.
| Statistic Category | Value |
|---|---|
| Duration Statistics | |
| Avg. Duration | 1019.3 sec |
| Std. Duration | 339.0 sec |
| Min Duration | 384.0 sec |
| Max Duration | 1800.0 sec |
| Textual Statistics | |
| Total # of Tokens | 2.310M |
| Avg. # of Tokens/Video | 3299.6 |
| Transcript Density | 88.18% |
| Audio Statistics | |
| Audio Availability | 100% |
Guide Evaluation
The proposed data-analysis-conditioned prompting framework is evaluated against a user-prompt-only baseline using an LLM evaluator. Each sample is scored across editing direction, scene selection clarity, composition specificity, editing detail, content grounding, feasibility, and sample-specific guidance.
Average Score by Evaluation Criterion
Average Total Score by Subcategory
Citation
@misc{lee2026kosum,
title = {Beyond Highlight Detection: Most-Replayed Driven Multimodal Analysis of Korean YouTube Videos for Highlight Editing Guidance},
author = {Lee, Chanhee and Jang, Jinho and Ha, Sungjun and Jung, Jinwoong},
year = {2026},
note = {2026 Spring DSC3028 Capstone Project Technical Report, Sungkyunkwan University}
}
