Jungang Li1 Sicheng Tao111footnotemark: 1 Yibo Yan1,2
Xiaojie Gu1 Haodong Xu1 Xu Zheng1,2 Yuanhuiyi Lyu1,2
Linfeng Zhang3 Xuming Hu1,2
1The Hong Kong University of Science and Technology (Guangzhou)
2The Hong Kong University of Science and Technology 3Shanghai Jiao Tong UniversityEqual contribution.Corresponding author.
Abstract
Endeavors have been made to explore Large Language Models for video analysis (Video-LLMs), particularly in understanding and interpreting long videos. However, existing Video-LLMs still face challenges in effectively integrating the rich and diverse audio-visual information inherent in long videos, which is crucial for comprehensive understanding. This raises the question: how can we leverage embedded audio-visual information to enhance long video understanding? Therefore, (i) we introduce SAVEn-Vid, the first-ever long audio-visual video dataset comprising over 58k audio-visual instructions. (ii) From the model perspective, we propose a time-aware Audio-Visual Large Language Model (AV-LLM), SAVEnVideo, fine-tuned on SAVEn-Vid. (iii) Besides, we present AVBench, a benchmark containing 2,500 QAs designed to evaluate models on enhanced audio-visual comprehension tasks within long video, challenging their ability to handle intricate audio-visual interactions. Experiments on AVBench reveal the limitations of current AV-LLMs. Experiments also demonstrate that SAVEnVideo outperforms the best Video-LLM by 3.61% on the zero-shot long video task (Video-MME) and surpasses the leading audio-visual LLM by 1.29% on the zero-shot audio-visual task (Music-AVQA). Consequently, at the 7B parameter scale, SAVEnVideo can achieve state-of-the-art performance. Our dataset and code will be released at URL upon acceptance.
1 Introduction
Humans generally perceive the world through vision and hearing, responding to and processing complex external events.Large Language Models (LLMs)[51, 49, 1, 14, 64, 2] are demonstrating increasingly strong capabilities on general tasks [72, 26, 75, 76]. Multimodal Large Language Models (MLLMs)[42, 35, 44, 65], achieved through techniques such as modality alignment and visual instruction fine-tuning under progressive training strategies, also exhibit significant potential[87, 34, 31, 30, 74]. In addition, recent advances in Video Large Language Models (Video-LLMs)[39, 7, 66, 54] indicate a shift from short video understanding tasks toward the intricate challenges of long-term video understanding[48, 20, 69, 5]. However, existing methods face limitations in handling long-video understanding tasks, particularly with regard to the audio modality. Most current models[35, 39, 86, 65] lack the capability to process audio inputs, focusing instead on compressing and retaining essential features of visual information in videos to enhance the model’s capacity for complex visual event comprehension. Among the few long-video LLMs that incorporate audio[21, 80, 40], the audio modality is often treated merely as user query instructions rather than encompassing a broader spectrum of audio-visual events — where the focus remains largely on human speech rather than general sounds[11, 68, 4, 79, 33, 17].
The audio-visual content in video data reflects the rich dynamic and multimodal information of the real world[17, 33]. Capturing the complex integration of auditory and visual cues in long videos allows the Video-LLMs to comprehensively interpret the intricate and intertwined information between various events in the video, which is particularly important for developing a fully multimodal video model capable of fully understanding the complex realities of the world[47, 45, 70, 60]. Some work has explored how to enable models to understand the audio-visual information and perform audio-visual tasks driven by natural language on short videos: AVicuna[61] introduced an audio-visual dataset with temporal annotations, PU-VALOR, which aligns audio-visual events with time intervals and corresponding text labels after model fine-tuning; VideoLLaMA2[12] enhanced its audio-visual expression capabilities by jointly training an additional audio branch. However, these works still have limited capacity to handle audio-visual information in long videos.To develop a long Video-LLM with maximum capability, there are three key challenges that need to be addressed, as shown in Figure 1:
- •
Existing long video datasets have detailed visual information annotations but lack detailed audio-visual event captions and accurate timestamp annotations, making it difficult to understand complex audio-visual events in long videos (i.e., longer than 1 minute) (Figure 1 (a));
- •
Existing benchmarks focusing on the audio-visual capability are mainly composed of short videos (i.e., less than 1 minute), thus current models can answer questions relying solely on single-modal processing capacity, neglecting the necessity to understand audio-visual information simultaneously to answer questions (Figure 1 (b));
- •
There exists an urgent need to enable a comprehensive understanding of audio-visual information within long video contexts in MLLM community.
To address these challenges, we improved the existing model structure and synthesized new datasets for the domain based on existing data. Specifically, we proposed SAVEnVideo, a new AV-LLM that better focuses on the audio-visual features in videos. It aligns audio and visual modalities in both space and time and implements an audio-visual compression projection layer that effectively compresses feature tokens, allowing the model to handle longer contexts while fully aligning audio-visual features, ultimately achieving exceptional audio-visual joint understanding on long videos. At the same time, we developed a dataset pipeline to extract visual and audio data from videos. Leveraging state-of-the-art open-source models, we generated high-quality captions for both video and audio content. These captions were then integrated into joint audio-visual descriptions using Qwen2.5-72B[62], alongside well-designed prompts to create Q&A pairs for audio-visual queries. This process resulted in SAVEn-Vid, a large-scale dataset featuring detailed timestamp annotations and comprehensive audio-visual captions, comprising 21k AV captions and 37k Q&A pairs. To evaluate the advanced audio-visual understanding capabilities of Video-LLMs, we further proposed a benchmark called AVBench, which was manually inspected and requires models to understand both the audio-visual queries in questions and advanced audio-visual events in videos simultaneously.
In summary, we contribute in the following aspects:
- •
We propose SAVEnVideo, a Video-LLM that aligns audio-visual features in both spatial and temporal dimensions, with a compression layer for efficient long-context handling, achieving superior audio-visual understanding on long videos.
- •
We develop SAVEn-Vid, comprising extensive timestamped audio-visual captions and Q&A pairs. It leverages high-quality captions and merges them with prompt-engineered audio-visual queries to support nuanced model training.
- •
To assess the advanced audio-visual comprehension, we present AVBench, a curated evaluation set that challenges models on complex audio-visual queries and interactions within long-video contexts.
2 Related Work
2.1 Audio-Visual LLMs
Recent advancements in multimodal information processing have seen an increasing trend towards integrating audio modality to complement or combine with visual information, thereby enriching the representation and understanding of multimedia content [47, 45, 70, 60, 41, 56, 89]. Early explorations, such as those conducted with VideoChat[39] utilizing Whisper[52] as the audio encoder, have laid the groundwork for leveraging speech information effectively. PandaGPT[57] integrates ImageBind[24] and Vicuna[13], employing image-text pairs for training to enhance cross-modal understanding. The VAST[9] model further advances this field by being trained on the VAST-27M[9] dataset, an omni-modality video caption dataset, to boost its multimodal capabilities. CAT[81] introduces a clue aggregator designed to gather relevant cues associated with a question, enabling the model to understand and respond to specific audio-visual events more accurately. Video-Salmonn[58] proposes a multi-resolution causal Q-former architecture that efficiently bridges pre-trained audio-visual encoders with the main LLM while maintaining the capability to process other video elements. Additionally, models like Baichuan Omni[40], Video-LLaMA[84], and Video-LLaMA2[12] have undergone fine-tuning specifically for the audio modality to strengthen their comprehension of multimodal content. In addition, models such as NExT-GPT[71], OneLLM[27], and VITA[21] have adopted the concept of an ‘omni-encoder’ to uniformly process different data modalities, including text, images, videos, and audio. These developments collectively represent significant strides towards achieving more sophisticated and nuanced multimodal interaction and understanding.
Dataset Train Set Test Set MS TA VD AD AVD VQA AVQA Annotated Content Avg. Duration (s) Video Dataset Sharegpt4video[7] ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ video 26 Cinepile[53] ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ video, subtitle 160 NExT-QA[73] ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ video 48 Video-MME[19] ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✗ video, subtitle 1020 LongVideoBench[69] ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✓ ✗ video, subtitle 480 MovieChat[55] ✓ ✓ ✗ ✓ ✓ ✗ ✗ ✓ ✗ video 420 Audio-Visual Dataset UnAV-100[23] ✓ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✓ audio, video 42 VAST-27M[9] ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✓ audio, video 20 AVQA[78] ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✓ ✓ audio, video 60 AVInstruct[81] ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✓ audio, video 115 Music-AVQA[37] ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ audio, video 10 VGGSound[6] ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓ audio, video 10 AVBench+SAVEnVid(Ours) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ audio, video 182
2.2 Multimodal Video Datasets
Early audio-visual datasets such as AudioSet[22], VGG-Sound[6], and AVE[63] primarily focused on coarse-grained labels with limited temporal coverage and simplistic tag information. More recent datasets like Music-AVQA[36], AVQA[78], and AVInstruct[81] introduced QA pairs to facilitate instruction tuning for audio-visual data, yet our analysis reveals that these designs often fail to fully exploit the potential of multimodal data, as questions can be answered using information from a single modality. UnAV-100[23] made a significant step by providing dense annotations for untrimmed videos, however, the simplicity of its label information restricts the full utilization of multimodal large language models. The VAST-27M[9] dataset leverages large language models for automated labeling, achieving a comprehensive integration of audio-visual information, but it falls short in handling long-duration videos and lacks the provision of instructional data.
Given the more granular requirements for the advanced audio-visual task, existing datasets exhibit notable limitations in terms of video length, annotation detail, and the depth of multimodal information extraction. To address these challenges, we introduce the SAVEn-Vid dataset, a resource specifically designed for long-form audio-visual comprehension tasks. In contrast to existing audio-visual datasets, SAVEn-Vid — constructed using state-of-the-art models for automated generation — aims to provide a richer, more detailed, and temporally extended dataset that supports advanced research and applications in multimodal understanding and long audio-visual video understanding.
3 SAVEn-Vid and AVBench
To accommodate the challenges mentioned in the introduction, we provide SAVEn-Vid for training, and AVBench for testing. SAVEn-Vid and AVBench are constructed by a same pipeline with different data sources. In addition, we have a quality control for AVBench, more details can be found in the Appendix.
Fine-grained descriptions are one of the key features of our dataset. We argue that long-form videos, rich in semantics and events, require more detailed annotations. However, annotating long videos presents significant challenges. To address this, we adopt a segment-based approach. Based on visual event boundary detections technique, each video in the dataset is segmented into multiple parts. Each segment is annotated with timestamps and corresponding visual, audio, and audio-visual descriptions, forming a dense description. Leveraging these three types of descriptions and the timestamps, each video is associated with Multiple Choice Question and Answering (MC-QA) tasks. Each QA requires the use of both visual and auditory modalities to answer, guiding or testing the model’s multimodal capabilities thoroughly.
We leverage state-of-the-art open-source MLLMs to automatically generate detailed video descriptions, along with MC-QA. This approach allows for the large-scale production of high-quality textual annotations for video content, without the need for manual labeling of all entries, thereby enhancing pipeline efficiency and enriching the dataset’s depth and breadth across various tasks and annotations. Table 1 is a comparative table that clearly demonstrates the advantages of our dataset in terms of modality information, annotation contents, and video duration.
3.1 Data Collection
Data diversity is a key consideration in our approach because the synergy of vision and sound in real life is extensive and diverse. To achieve this, we have hierarchically incorporated a variety of video sources with different lengths, enriching the diversity in scenes and themes by leveraging other high-quality video datasets. Additionally, the compatibility of the data with audio-visual tasks is crucial and given our focus on long audio-visual tasks, we aim to include video sources that are rich in audio-visual information and have longer durations. After a comprehensive survey, we select VideoMME[20] (for its longer duration), FineVideo[18] (for its rich speech and audio information), MovieChat[55] (for its long videos and diverse audio contents), NextQA[73] (for its high suitability for video tasks), and VGG Sound[6] (for its varied audio classifications). We use the original split of these data sources for the training and test sets. These sources provide a wide range of video data from different websites, perspectives, and domains. Based on these high-quality data sources, we will continue to explore the potential of both visual and audio information.
3.2 Construction Pipeline
Our dataset construction pipeline, as illustrated in Figure 2, follows a structured approach to handle the complexities of processing long videos.
Video Clip Segmentation. We start by segmenting long videos into shorter clips to make the data more manageable, as shown in Figure 2 (a). For this task, we employ AutoShot[88], a boundary-detection-based video segmentation model that divides each long video into distinct clips, serving as the foundation for all subsequent processing. These segmented clips may still retain a substantial length compared to traditional short videos, allowing us to capture richer context within each segment.
Visual Captioning. Next, we apply visual captioning to these segmented clips using Qwen2VL[3, 65], an advanced open-source video model, to generate objective visual descriptions (Figure 2 (b)). This phase focuses on providing a clear, objective description of the visual content, capturing essential elements without subjective interpretation.
Audio Captioning. Simultaneously, we perform audio captioning for each video clip using Qwen2Audio[15, 16], as depicted in Figure 2 (c). Unlike standard ASR (Automatic Speech Recognition), this step emphasizes describing the audio’s characteristics, such as the timbre, environment, events, and emotional tone, rather than transcribing spoken language. This approach enables us to capture nuanced audio information relevant to the video’s context.
Audio-Visual Description Fusion. Once we have both visual and audio descriptions, we merge them in the audio-visual description fusion phase, as shown in Figure 2 (d). Utilizing the multimodal capabilities of Qwen2.5[77, 62], we integrate the visual and audio captions to create comprehensive descriptions for each video segment. This fusion provides a more holistic view of the scene by combining auditory and visual elements, with synchronized timestamps to support fine-grained analysis.
Audio-Visual Question-Answer Pair Generation. In the final phase as illustrated in Figure 2 (e), we generate question-answer pairs based on the audio-visual captions. This process involves consistency and redundancy checks to ensure question quality. By leveraging Qwen2.5[77, 62], we formulate questions that draw on temporal, visual, and audio information, testing the model’s ability to understand complex multimodal interactions. Additionally, two distinct formats of temporal questions are introduced: (1) questions with explicit temporal segments, requiring answers based on specific time intervals, and (2) questions where timestamps are presented as separate options, encouraging the model to deduce the answer from context. This method enhances the depth and robustness of our question set, promoting advanced reasoning over multimodal information.
4 SAVEnVideo
Figure 3 presents the architecture of SAVEnVideo. It is composed of a frozen visual encoder, a frozen audio encoder, a frozen textual encoder, a Temporal Linear Projection (TLP), an Audio-Visual Time-Spatial Resampler (AVTS), and a LLM enhanced with LoRA. By utilizing temporal signal through TLP, we are able to conduct event-level modeling in scenarios involving long contexts. Additionally, AVTS facilitates modality fusion effectively, thus alleviating the burden on the LLM.
4.1 Model Architecture
4.1.1 Multimodal Encoders
We use frozen encoders for feature extraction to ensure that our model can extract the necessary feature information from multimodal input.Vision Encoder, following LLaVA-OneVision[35], we adopt SigLIP[83] as Vision Encoder, which provides a rich representation of the input images and video frames, contributing to the model’s visual understanding.Audio Encoder, for audio encoding, we adopt BEATs[8] as the Audio Encoder, which effectively captures intricate audio patterns and generates embeddings encoding essential temporal and spectral information. Integrating BEATs into SAVEnVideo enhances the encoder’s ability to capture temporal dynamics, enabling a seamless fusion of audio-visual features.
4.1.2 Temporal Linear Projection
In order to enable the model to collaborate with audio-visual signals to complete tasks in long video tasks and achieve advanced audio-visual understanding, we attempt to cleverly introduce temporal signals as the pivot for audio-visual representation in SAVEnVideo. Unlike VTimeLLM[29], which treats the time signal as a text embedding, we input the time signal as a separate modality and expect it to serve as a trigger that enhances audio-visual understanding, guiding the model to collaborate with audio-visual signals to complete tasks. We accomplish this concept through Temporal Linear Projection (TLP). Specifically, we extract task-related temporal information from the query in context and simultaneously model the timestamps for this temporal information. Specifically, for a given video sequence with timestamps , we define a feature vector for each timestamp as , where . To capture temporal dependencies, we apply a linear transformation to each timestamp feature:
(1) |
where and are the weight matrix and bias vector, respectively, optimized to better adapt to the temporal dynamics of the task. Subsequently, we integrate these temporal features into the AVTS Resampler (see in 4.1.3), enabling joint modeling of temporal information and content features to more accurately capture the temporal dependencies within the video content.
Models #Parameters #Frames Supported Modality Audio-Visual Task Long-Video Task Audio Video Music-AVQA Video-MME Close Source GPT4-o[50] - 1fps ✗ ✓ - 71.00 Gemini 1.5 Pro[25] - 1fps ✓ ✓ - 75.00 Open Source Video Model LLaVA-NeXT-Video[86] 7B 32 ✗ ✓ - 46.50 VideoLLaMA2[12] 7B 32 ✓ ✓ - 46.60 LongVA[85] 7B 128 ✗ ✓ - 52.60 OneLLM[27] 7B 15 ✓ ✓ 47.60 - NExT-GPT[71] 7B 24 ✓ ✓ 79.84 42.64 CREMA[82] 4B 4 ✓ ✓ 75.60 - PandaGPT[57] 7B 10 ✓ ✓ 81.85 43.45 SAVEnVideo (w.o. SAVEnVid) 7B 16 ✓ ✓ 74.80 53.60 SAVEnVideo 7B 16 ✓ ✓ 83.14 56.21
4.1.3 Audio-Visual Time-Spatial Resampler
Due to the large amount of redundant information in both visual features and audio signals in video tasks, after encoding visual and audio features using a visual encoder and an audio encoder, and generating temporal embeddings via the TLP, we construct the Audio-Visual Time-Spatial Resampler (AVTS Resampler) inspired by Q-Former[38]. The AVTS Resampler is composed of two MLPs that separately process visual embeddings and audio embeddings, along with a non-causal transformer decoder. This decoder leverages a set of learnable weights as initial queries, naturally compressing the length of the audio-visual features.
We randomly initialize the Attention Layers in AVTS Resampler and incorporate both sinusoidal position encodings and learnable positional embeddings for the MLP outputs and queries in each cross-attention layer. Given the input feature sequence with elements, we define the initial query embeddings as , where . Each cross-attention layer computes the output embeddings as follows:
(2) |
where represents the attention weights, defined by:
(3) |
This process allows the AVTS Resampler to aggregate relevant audio-visual information effectively, significantly reducing redundancy while retaining key temporal-spatial features crucial for downstream tasks.
4.2 Training
To align the multimodal embeddings with the input token space of the LLM, we performed multimodal alignment based on the work of VideoLLaMA[84]. We also used our carefully constructed dataset SAVEn-Vid for instruction fine-tuning based on LoRA[28]. More details about the training can be found in the appendix.
4.2.1 Multimodal Alignment
In multimodal alignment, we first fix the Vision Encoder, Audio Encoderand and LLM, and only update the MLP that processes visual embeddings TLP, and attention layers in the AVTS Resampler, using LCS-558K[43], and a train set of Panda-70M[10]. Next we unfrozen the MLP that processes audio embeddings in the AVTS Resampler, optimizing with Audioset[22], AudioCap[32], and Auto-ACD[59]. In the multimodal alignment stage, since our training data includes both images and short videos, we explicitly incorporate the full-length time information from the video and audio as input, while the TLP is also optimized at this stage. This ensures that the temporal relationships in both video and audio can be coarsely modeled during the multimodal alignment stage. Our overall idea is to treat the temporal signal as a trigger for the model to understand the audio-visual information, enabling the model to fully leverage the audio-visual content in long videos with a large amount of redundant information.
4.2.2 Audio-Visual Temporal Fine-Tuning
In this stage, we utilize three datasets — VideoInstruct100K[46], InternVid[67], and VGGSound[6]—to train LLM with LoRA, enhancing the model’s capacity to process and align audio-visual temporal data. Finally we further fine-tune SAVEnVideo on our custom-designed SAVEn-Vid. This fine-tuning stage allows SAVEnVideo to adapt its learned representations to comprehend high-level audio-visual information from long videos with dense audio-visual events.
5 Experiments
Implementation Details. For the Vision Encoder and Audio Encoder, we utilize SigLIP[83] and Beats[8], respectively, while employing Qwen2[77] for the LLM. During modality alignment, our primary focus is training the newly initialized AVTS Resampler. Empirically, we set the global batch size and learning rate to 512 and 2e-3 for modality alignment, and 1,024 and 2e-4 for Audio-Visual Temporal Fine-Tuning. We pre-train SAVEnVideo for 1 epoch and fine-tune the pre-trained model for up to 2 epochs. When processing videos, SAVEnVideo uniformly samples 16 frames.
5.1 Benchmarks and Metrics
We use VideoMME[20] and Music-AVQA[37] to evaluate our model. VideoMME is a long video benchmark for assessing long video understanding capabilities. For VideoMME, videos are formally categorized by duration, which includes a subset of long videos ranging from 30 minutes to 1 hour. MUSIC-AVQA is an open VideoQA dataset designed to assess comprehensive multimodal understanding and spatio-temporal reasoning of audio-visual scenarios. We utilize the test set of MUSIC-AVQA, which consists of 9,185 QA pairs. We perform standardized evaluation using greedy decoding (num_beams=1) and benchmark our results against other open-source and proprietary models. In addition, the models we tested are mainly divided into two categories: one category consists of models that only support inputting video frames, such as LLaVA-NExT-Video[86], which may accept more video frames but cannot handle audio modalities; the other category includes models (AV-LLM) like PandaGPT[57] and NExT-GPT[71] that can accept both video and audio information simultaneously, but their ability to process long video information is often limited, and due to the constraints of the training datasets, they tend to perform poorly with videos that feature complex audio-visual events and are unable to respond to enchenced audio-visual questions.
5.2 Comparison Experiments
Table 2 presents our experimental results on two video understanding benchmarks. Our model demonstrates excellent performance among video-LLMs with 7B parameters. In VideoMME, although LongVA sampled more frames (128) in the visual encoding part compared to our model (16), and both LongVA and our model selected Qwen2[77] as the base language model, SAVEnVideo ultimately achieved an accuracy that is 3.61% higher than LongVA. We analyze that this might be due to sound-related scenario in VideoMME. Due to the limitation of the number of frames, models have a high probability to miss key information. However, in the scenario with high audio-visual consistency, the missed information can be compensated by the audio input, thereby, our model, due to its additional processing of audio information, achieved better performance in VideoMME. This supports our viewpoint: if a model can effectively model audio-visual information, it helps the model understand complex audio-visual events and ultimately assist it in successfully responding to audio-visual-related questions. Furthermore, despite LongVA extending the context length of the language model, its training did not incorporate long video data, which likely contributed to our surpassing its performance. This also highlights the contribution of SAVEn-Vid to the field.
In addition, our model achieved an accuracy of 83.14% in Music-AVQA[37], which also performs excellently among AV-LLMs with the same number of parameters. Since most existing long Video-LLMs do not support input audio information, the audio-visual models we chose were mostly trained on short video datasets without rich audio-visual information. This may be the reason we were ultimately able to surpass PandaGPT and achieve SOTA in both the VideoMME and Music-AVQA benchmarks. This also indicates that our model has the ability to analyze inputs rich in audio-visual content and accurately answer questions related to audio-visual content.
5.3 Results on AVBench
We compare our proposed SAVEnVideo with existing models on our proposed AVBench, as shown in Table 3. Among open-source models with 7B parameters, SAVEnVideo achieves a significant performance improvement, attaining an AVBench score of 66.71%. This outperforms NExT-GPT and PandaGPT, despite all models supporting both audio and video modalities and processing a similar number of frames. The results show the difficulty and the importance of our enhanced audio-visual questions, indicating current AV-LLMs can not deal with long and complex audio-visual scenario. These results highlight the effectiveness of our approach in leveraging audio-visual information for advanced comprehension tasks. SAVEnVideo not only surpasses other open-source models but also approaches the performance of closed-source counterparts, emphasizing the benefits of our architecture and training strategies in handling complex audio-visual interactions within long-video contexts.
Models #Parameters #Frames Supported Modality AVBench Audio Video Close Source GPT4-o - 1fps ✗ ✓ 77.29 Gemini 1.5 Pro - 1fps ✓ ✓ 75.71 Open Source Video Model NExT-GPT 7B 24 ✓ ✓ 34.04 PandaGPT 7B 10 ✓ ✓ 32.26 SAVEnVideo (w.o. SAVEnVid) 7B 16 ✓ ✓ 58.49 SAVEnVideo 7B 16 ✓ ✓ 66.71
5.4 Ablation Study
Methods VideoMME Music-AVQA AVBench w.o.audio input 48.70 52.36 56.21 w.o.SAVEn-Vid 53.63 74.80 58.49 w.o.AVTS Resampler(MLP) 39.89 70.12 49.70 SAVEnVideo 56.21 83.14 66.70
To evaluate the contribution of each component in our proposed SAVEnVideo, we conducted an ablation study, the results of which are summarized in Table 4. We systematically removed key modules and assessed their impact on performance across three benchmarks: VideoMME, Music-AVQA, and AVBench.Removing the audio input (w/o audio input) resulted in a significant drop in performance across all benchmarks, with the accuracy of VideoMME dropping to 48.70%, Music-AVQA to 52.36%, and AVBench to 56.21%. This highlights the crucial role of audio features in enhancing the model’s understanding of audio-visual content.Excluding the SAVEn-Vid dataset from training led to a moderate decline, with VideoMME accuracy at 53.63%, Music-AVQA at 74.80%, and AVBench at 58.49%. This indicates that SAVEn-Vid makes a significant contribution to the effective fusion of audio and visual modalities, particularly benefiting tasks that require complex audio-visual synchronization.The Results of replacing the AVTS resampler with a simple MLP (w/o AVTS Resampler (MLP)) indicate that this module contributes significantly to the overall effectiveness of the model. This demonstrates the fundamental role of the AVTS resampler in aligning and compressing audio-visual features over time, which is crucial for understanding the context of long videos.
6 Conclusion
This paper introduces SAVEn-Vid, the first audio-visual dataset for long video understanding, aimed at enhancing the understanding of long videos context through integrated audio-visual information. We developed SAVEnVideo, which explicitly models the temporal modality. This approach uses temporal signals as triggers for the model to comprehend audio-visual features, enabling alignment of these features across both spatial and temporal dimensions. Moreover, we proposed AVBench for evaluating the audio-visual understanding in long videos. SAVEnVideo outperforms the leading models on VideoMME, Music-AVQA and AVBench. Our extensive analysis further guided the design of large language models for long videos to enable more complex audiovisual understanding.
References
- Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
- Anthropic [2024]Anthropic.The claude 3 model family: Opus, sonnet, haiku.2024.
- Bai etal. [2023]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023.
- Cappellazzo etal. [2024]Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, and Maja Pantic.Large language models are strong audio-visual speech recognition learners.arXiv preprint arXiv:2409.12319, 2024.
- Chandrasegaran etal. [2024]Keshigeyan Chandrasegaran, Agrim Gupta, LeaM. Hadzic, Taran Kota, Jimming He, Cristobal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Fei-Fei Li.Hourvideo: 1-hour video-language understanding.In Advances in Neural Information Processing Systems, 2024.
- Chen etal. [2020]Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman.Vggsound: A large-scale audio-visual dataset.In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
- Chen etal. [2024a]Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, etal.Sharegpt4video: Improving video understanding and generation with better captions.arXiv preprint arXiv:2406.04325, 2024a.
- Chen etal. [2023]Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei.BEATs: Audio pre-training with acoustic tokenizers.In Proceedings of the 40th International Conference on Machine Learning, pages 5178–5193. PMLR, 2023.
- Chen etal. [2024b]Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu.Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Systems, 36, 2024b.
- Chen etal. [2024c]Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, ByungEun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, etal.Panda-70m: Captioning 70m videos with multiple cross-modality teachers.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024c.
- Chen etal. [2024d]Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, LuisFernando D’Haro, RobbyT Tan, and Haizhou Li.Beyond single-audio: Advancing multi-audio processing in audio large language models.arXiv preprint arXiv:2409.18680, 2024d.
- Cheng etal. [2024]Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing.Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024.
- Chiang etal. [2023a]Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023a.
- Chiang etal. [2023b]Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023b.
- Chu etal. [2023]Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou.Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023.
- Chu etal. [2024]Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen2-audio technical report.arXiv preprint arXiv:2407.10759, 2024.
- Deshmukh etal. [2023]Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang.Pengi: An audio language model for audio tasks.Advances in Neural Information Processing Systems, 36:18090–18108, 2023.
- Farré etal. [2024]Miquel Farré, Andi Marafioti, Lewis Tunstall, Leandro VonWerra, and Thomas Wolf.Finevideo.https://huggingface.co/datasets/HuggingFaceFV/finevideo, 2024.
- Fu etal. [2023]Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, etal.Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023.
- Fu etal. [2024a]Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, etal.Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024a.
- Fu etal. [2024b]Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, etal.Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024b.
- Gemmeke etal. [2017]JortF. Gemmeke, Daniel P.W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R.Channing Moore, Manoj Plakal, and Marvin Ritter.Audio set: An ontology and human-labeled dataset for audio events.In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- Geng etal. [2023]Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng.Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22942–22951, 2023.
- Girdhar etal. [2023]Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, KalyanVasudev Alwala, Armand Joulin, and Ishan Misra.Imagebind: One embedding space to bind them all.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023.
- Google [2024]GeminiTeam Google.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024.
- Hadi etal. [2023]MuhammadUsman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, MuhammadBilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, etal.A survey on large language models: Applications, challenges, limitations, and practical usage.Authorea Preprints, 2023.
- Han etal. [2024]Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue.Onellm: One framework to align all modalities with language.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024.
- Hu etal. [2022]EdwardJ Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2022.
- Huang etal. [2023]Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu.Vtimellm: Empower llm to grasp video moments.arXiv preprint arXiv:2311.18445, 2(3):9, 2023.
- Huang etal. [2024]Kaichen Huang, Jiahao Huo, Yibo Yan, Kun Wang, Yutao Yue, and Xuming Hu.Miner: Mining the underlying pattern of modality-specific neurons in multimodal large language models.arXiv preprint arXiv:2410.04819, 2024.
- Huo etal. [2024]Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, and Xuming Hu.Mmneuron: Discovering neuron-level domain-specific interpretation in multimodal large language model.arXiv preprint arXiv:2406.11193, 2024.
- Kim etal. [2019]ChrisDongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim.AudioCaps: Generating captions for audios in the wild.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Latif etal. [2023]Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, etal.Sparks of large audio models: A survey and outlook.arXiv preprint arXiv:2308.12792, 2023.
- Li etal. [2024a]Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan.Seed-bench: Benchmarking multimodal large language models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024a.
- Li etal. [2024b]Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li.Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024b.
- Li etal. [2022a]Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu.Learning to answer questions in dynamic audio-visual scenarios.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022a.
- Li etal. [2022b]Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, and Di Hu.Learning to answer questions in dynamic audio-visual scenarios.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022b.
- Li etal. [2023]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Li etal. [2024c]KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao.Videochat: Chat-centric video understanding, 2024c.
- Li etal. [2024d]Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, and Weipeng Chen.Baichuan-omni technical report.arXiv preprint arXiv:2410.08565, 2024d.
- Liang etal. [2024]PaulPu Liang, Amir Zadeh, and Louis-Philippe Morency.Foundations & trends in multimodal machine learning: Principles, challenges, and open questions.ACM Computing Surveys, 56(10):1–42, 2024.
- Liu etal. [2023a]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning, 2023a.
- Liu etal. [2023b]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2023b.
- Liu etal. [2024a]Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
- Liu etal. [2024b]Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, etal.Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024b.
- Maaz etal. [2023]Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbaz Khan.Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023.
- Madan etal. [2024]Neelu Madan, Andreas Møgelmose, Rajat Modi, YogeshS Rawat, and ThomasB Moeslund.Foundation models for video understanding: A survey.arXiv preprint arXiv:2405.03770, 2024.
- Mangalam etal. [2023]Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik.Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023.
- OpenAI [2022]OpenAI.Introducing chatgpt.https://openai.com/blog/chatgpt, 2022.
- OpenAI [2024]OpenAI.Gpt-4o system card, 2024.
- Ouyang etal. [2022]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
- Radford etal. [2023]Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever.Robust speech recognition via large-scale weak supervision.In Proceedings of the 40th International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Rawal etal. [2024]Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein.Cinepile: A long video question answering dataset and benchmark, 2024.
- Shen etal. [2024]Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, etal.Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024.
- Song etal. [2023a]Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, etal.Moviechat: From dense token to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023a.
- Song etal. [2023b]Shezheng Song, Xiaopeng Li, Shasha Li, Shan Zhao, Jie Yu, Jun Ma, Xiaoguang Mao, and Weimin Zhang.How to bridge the gap between modalities: A comprehensive survey on multimodal large language model.arXiv preprint arXiv:2311.07594, 2023b.
- Su etal. [2023]Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai.Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023.
- Sun etal. [2024]Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, Yuxuan Wang, and Chao Zhang.video-SALMONN: Speech-enhanced audio-visual large language models.In Forty-first International Conference on Machine Learning, 2024.
- Sun etal. [2023]Luoyi Sun, Xuenan Xu, Mengyue Wu, and Weidi Xie.A large-scale dataset for audio-language representation learning.arXiv preprint arXiv:2309.11500, 2023.
- Tang etal. [2023]Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, etal.Video understanding with large language models: A survey.arXiv preprint arXiv:2312.17432, 2023.
- Tang etal. [2024]Yunlong Tang, Daiki Shimada, Jing Bi, and Chenliang Xu.Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue.arXiv preprint arXiv:2403.16276, 2024.
- Team [2024]Qwen Team.Qwen2.5: A party of foundation models, 2024.
- Tian etal. [2018]Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu.Audio-visual event localization in unconstrained videos.In ECCV, 2018.
- Touvron etal. [2023]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, etal.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
- Wang etal. [2024a]Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin.Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a.
- Wang etal. [2024b]Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang.Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture.arXiv preprint arXiv:2409.02889, 2024b.
- Wang etal. [2023]Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, etal.Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023.
- Wu etal. [2024a]Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kai-wei Chang, Ho-Lam Chung, AlexanderH Liu, and Hung-yi Lee.Towards audio language modeling-an overview.arXiv preprint arXiv:2402.13236, 2024a.
- Wu etal. [2024b]Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li.Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024b.
- Wu etal. [2023]Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and SYu Philip.Multimodal large language models: A survey.In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023.
- Wu etal. [2024c]Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua.NExT-GPT: Any-to-any multimodal LLM.In Proceedings of the International Conference on Machine Learning, pages 53366–53397, 2024c.
- Xi etal. [2023]Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, etal.The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023.
- Xiao etal. [2021]Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua.Next-qa: Next phase of question-answering to explaining temporal actions.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786, 2021.
- Yan and Lee [2024]Yibo Yan and Joey Lee.Georeasoner: Reasoning on geospatially grounded context for natural language understanding.In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 4163–4167, 2024.
- Yan etal. [2024a]Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, etal.Errorradar: Benchmarking complex mathematical reasoning of multimodal large language models via error detection.arXiv preprint arXiv:2410.04509, 2024a.
- Yan etal. [2024b]Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, and Yuxuan Liang.Urbanclip: Learning text-enhanced urban region profiling with contrastive language-image pretraining from the web.In Proceedings of the ACM on Web Conference 2024, pages 4006–4017, 2024b.
- Yang etal. [2024a]An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan.Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024a.
- Yang etal. [2022]Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu.Avqa: A dataset for audio-visual question answering on videos.In Proceedings of the 30th ACM International Conference on Multimedia, pages 3480–3491, 2022.
- Yang etal. [2024b]Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, etal.Air-bench: Benchmarking large audio-language models via generative comprehension.arXiv preprint arXiv:2402.07729, 2024b.
- Yao etal. [2024]Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, etal.Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024.
- Ye etal. [2024]Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao.Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024.
- Yu etal. [2024]Shoubin Yu, Jaehong Yoon, and Mohit Bansal.Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion.arXiv preprint arXiv:2402.05889, 2024.
- Zhai etal. [2023]Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer.Sigmoid loss for language image pre-training.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
- Zhang etal. [2023]Hang Zhang, Xin Li, and Lidong Bing.Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023.
- Zhang etal. [2024a]Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu.Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024a.
- Zhang etal. [2024b]Yuanhan Zhang, Bo Li, haotian Liu, Yongjae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li.Llava-next: A strong zero-shot video understanding model, 2024b.
- Zheng etal. [2024]Kening Zheng, Junkai Chen, Yibo Yan, Xin Zou, and Xuming Hu.Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models.arXiv preprint arXiv:2408.09429, 2024.
- Zhu etal. [2023]Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jincan Deng, Debing Zhang, Zhangyang Wang, and Ji Liu.Autoshot: A short video dataset and state-of-the-art shot boundary detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.
- Zhu etal. [2024]Ye Zhu, Yu Wu, Nicu Sebe, and Yan Yan.Vision+ x: A survey on multimodal learning in the light of data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
\thetitle
Supplementary Material
A Details of Datasets
A.1 Processing
As shown in Figure 2 (a), we use the video boundary detection model Autoshot[88] to extract a sequence of keyframes from the collected long video, and then we segment the long video into several short videos based on these keyframes. To avoid obtaining excessively short video segments, we ensure that each video segment contains at least 50 frames. Additionally, to prevent a long video from being cut into too many short segments and for processing efficiency, we also ensure that a video is only divided into a maximum of 20 video clips. To ensure the quality of the generated annotations, we employed the highest-performing models with the largest parameter sizes from the same series in our pipeline, namely Qwen2VL-72B[65], Qwen2.5-72B[62] and Qwen2-Audio-7B[16].
Due to the 30-second limit of Qwen2-Audio[16] when processing audio, when we perform audio captioning on video clips longer than 30 seconds, we will further divide the video clip into several segments of no more than 30 seconds of audio, perform audio captioning on these audio segments, and finally merge the audio captions.
A.2 Prompts
In every step of generating synthetic data, we meticulously designed prompts to ensure the effectiveness of each step. We present the prompts 1) used for Visual Captioning and those used for Audio Captioning as shown in Figure 7, 2) used for Audio-Visual Description Fusion as shown in Figure 8, 3) used for segments descriptions fusion as shown in Figure 9, 4) used for Audio-Visual Question-Answer Generation as shown in Figure 10
A.3 Statistics Information
To further analyze the descriptive capabilities of audiovisual events in our dataset, we counted the word distribution of captions and QAs in the dataset. The results are shown in Figure 4, which includes a large number of visual descriptions (scene, movement, black, etc.) and auditory descriptions (audio, sound, conversation, etc.).
Figure 5 shows the distribution of the length of all videos we collected, and Figure 6 shows the distribution of the length of videos in AVBench.
A.4 Cases
Figures 11 and 12 present illustrative examples of the outputs generated by our data generation pipeline. The questions are crafted to incorporate both audio and visual elements, ensuring that they effectively test the multi-modal capabilities of the models. Additionally, we include temporal questions to further enhance the depth of learning and evaluation, enabling a more comprehensive assessment of the models’ understanding across time-dependent contexts. These diverse and carefully designed questions not only serve as a rigorous evaluation framework but also provide valuable resources for fine-tuning models, further enhancing their multi-modal and temporal reasoning capabilities.
B Manual Quality Inspection for AVBench
We synthesized AVBench in the same way as SAVEn-Vid, but to ensure the quality of AVBench, we first ensured that each sample contained at least three or more events to guarantee the richness of the video content. Then, we manually conducted further checks on the synthesized AVBench, primarily focusing on 1) grammatical and formatting issues in the title descriptions, 2) issues of information redundancy and information leakage in the questions, ensuring that answers cannot be derived solely from the text information, and 3) questions that focus only on visual or auditory information. Through this process, we ultimately eliminated low-quality samples.
C Training Details
We train SAVEnVideo for 1 epoch during the multimodal alignment stage and up to 2 epochs during the fine-tuning stage. Empirically, the global batch size and learning rate are set to 512 and 2e-3, respectively. For the Audio-Visual Temporal Fine-Tuning stage, we increase the global batch size to 1,024 and reduce the learning rate to 2e-4.
All experiments are conducted on NVIDIA A-800 GPUs, leveraging mixed-precision training to optimize efficiency. SAVEnVideo processes video inputs by uniformly sampling 16 frames per video.