ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Vipula Rawte ¹^* Sarthak Jain ²^† Aarush Sinha ³^† Garv Kaushik ⁴^† Aman Bansal ⁵^† Prathiksha Rumale Vishwanath ⁵^† Samyak Rajesh Jain ⁶ Aishwarya Naresh Reganti ⁷^§ Vinija Jain ⁸^§ Aman Chadha ⁹^§ Amit P. Sheth ¹ Amitava Das ¹

AI Institute, University of South Carolina, USA¹
Guru Gobind Singh Indraprastha University, India²
Vellore Institute of Technology, India³
Indian Institute of Technology (BHU), India⁴
University of Massachusetts Amherst, USA⁵
University of California Santa Cruz, USA⁶
Amazon Web Services⁷
Meta⁸ Amazon GenAI⁹

^* Corresponding Author
^† Equal Contribution
^§ Worked independent of the position

arXiv Code

Dataset

Abstract

Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated.

We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories.

ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

Overall pipeline of building and evaluating the dataset.

Hallucination Categories

Omission Error: The generated video omits essential components of the initial prompt except in cases involving specified subject counts—resulting in an incomplete or inaccurate portrayal, or introduces unscripted actions or behaviors, it leads to a misrepresentation of the intended scene.
Numeric Variability: In a given prompt, if the subject count is specified, the generated video either increases or decreases the number of instances of the subject
Temporal Dysmorphia: Objects rendered within the video exhibit continuous temporal deformation, undergoing gradual or intermittent transformations in shape, scale, or orientation over the duration of the sequence
Physical Incongruity: The generated video violates fundamental physical laws or juxtaposes incongruent elements, leading to perceptual inconsistencies or cognitive dissonance for the viewer.
Vanishing Subject: The subject, or a portion thereof, in the generated video intermittently disappears at arbitrary points within the video's duration

Omission Error

Wildebeest and zebras are lying in the grass

Wooden cabinetry in a blue kitchen with white appliances

A airport runway with a large plane and cars parked on one side

White teddy bears cavort about an ice cream filled refrigerator in a department store window

Three men wearing specialty clothing with machines strapped to backs

Wildebeest and zebras are lying in the grass

Wooden cabinetry in a blue kitchen with white appliances

A airport runway with a large plane and cars parked on one side

White teddy bears cavort about an ice cream filled refrigerator in a department store window

Three men wearing specialty clothing with machines strapped to backs

Numeric Variability

White clock tower with three flags on top

2 bowls of fruit sit on a table

7 jars full of grain with rotten bananas hanging over them

Six people with their hands on an outdoor kiln

Two people on their surfboards in rough water

White clock tower with three flags on top

2 bowls of fruit sit on a table

7 jars full of grain with rotten bananas hanging over them

Six people with their hands on an outdoor kiln

Two people on their surfboards in rough water

Temporal Dysmorphia

A man in athletic wear swings a tennis racket through the air

A baseball game going on while a player holds a bat ready

older infant sitting outside holding a baby toothbrush in his mouth

Seated teenage girl eating her breakfast at home

Skateboarder and blue shirt and black jeans jumping on his board

A man in athletic wear swings a tennis racket through the air

A baseball game going on while a player holds a bat ready

older infant sitting outside holding a baby toothbrush in his mouth

Seated teenage girl eating her breakfast at home

Skateboarder and blue shirt and black jeans jumping on his board

Physical Incongruity

a animal that is walking in a crowd of people

2 men on a court play a game of tennis

Traffic lights on a street glowing in a night sky

there is a red stop sign in front of houses

Series of lights coming off of a passenger train at night

a animal that is walking in a crowd of people

2 men on a court play a game of tennis

Traffic lights on a street glowing in a night sky

there is a red stop sign in front of houses

Series of lights coming off of a passenger train at night

Vanishing Subject

Two young boys playing Wii bowling on a large television screen

A man scooping food into a pan

Slices of orange are arranged on a plate

ridgewood and charles ave stop sign four way street

A couple of guys are playing a soccer game

Two young boys playing Wii bowling on a large television screen

A man scooping food into a pan

Slices of orange are arranged on a plate

ridgewood and charles ave stop sign four way street

A couple of guys are playing a soccer game

Dataset Details

To construct the ViBe dataset, we selected 700 random captions from the MS COCO dataset, which is known for its diverse and descriptive textual prompts, making it an ideal resource for evaluating the generative performance of T2V models. These captions were then used as input for ten distinct open-source T2V models, chosen to represent a variety of architectures, model sizes, and training paradigms. The specific models included in the study were:

Name	Source	Link
AnimateLCM	HuggingFace	View
AnimateLightning	HuggingFace	View
AnimateMotionAdapter	HuggingFace	View
HotShotXL	HuggingFace	View
MagicTime	GitHub	View
Show1	GitHub	View
MSB1.7b	HuggingFace	View
zeroscope_576w	HuggingFace	View
zeroscope_XL	HuggingFace	View
MORA	GitHub	View

A detailed comparison of model accuracy and F1 score is presented for various combinations of models utilizing VideoMAE and TimeSFormer embeddings. The model yielding the highest performance is denoted in green for easy identification.

BibTeX

@article{rawte2024vibe,
  title={ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models},
  author={Rawte, Vipula and Jain, Sarthak and Sinha, Aarush and Kaushik, Garv and Bansal, Aman and Vishwanath, Prathiksha Rumale and Jain, Samyak Rajesh and Reganti, Aishwarya Naresh and Jain, Vinija and Chadha, Aman and others},
  journal={arXiv preprint arXiv:2411.10867},
  year={2024}}