Ausnahme gefangen: SSL certificate problem: certificate is not yet valid ๐Ÿ“Œ Vid2Seq: a pretrained visual language model for describing multi-event videos

๐Ÿ  Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeitrรคge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden รœberblick รผber die wichtigsten Aspekte der IT-Sicherheit in einer sich stรคndig verรคndernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch รผbersetzen, erst Englisch auswรคhlen dann wieder Deutsch!

Google Android Playstore Download Button fรผr Team IT Security



๐Ÿ“š Vid2Seq: a pretrained visual language model for describing multi-event videos


๐Ÿ’ก Newskategorie: AI Nachrichten
๐Ÿ”— Quelle: ai.googleblog.com

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations โ€” for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.

In this post, we introduce โ€œVid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioningโ€, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens.

A visual language model for dense video captioning

Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.

For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.

Vid2Seq model overview: We formulate dense event captioning as a sequence-to-sequence problem, using special time tokens to allow the model to seamlessly understand and generate sequences of tokens containing both textual semantic information and temporal localization information grounding each text sentence in the video.

Large-scale pre-training on untrimmed narrated videos

Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.

We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom).

Results on downstream dense video captioning benchmarks

The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.

Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better).

Conclusion

We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.


Acknowledgements

This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.

...



๐Ÿ“Œ 6.4.2: Using a natural language model: Comment spam detection - loading a pretrained NLP model


๐Ÿ“ˆ 49.61 Punkte

๐Ÿ“Œ Deep Learning For Large-Scale Biomolecular Dynamics: Harvard Research Scales A Large, Pretrained Allegro Model On Various Systems


๐Ÿ“ˆ 33.54 Punkte

๐Ÿ“Œ Meet mPLUG-Owl2: A Multi-Modal Foundation Model that Transformsย Multi-modal Large Language Models (MLLMs) with Modality Collaboration


๐Ÿ“ˆ 32.98 Punkte

๐Ÿ“Œ Meet SPHINX: A Versatile Multi-Modal Large Language Model (MLLM) with a Mixer of Training Tasks, Data Domains, and Visual Embeddings


๐Ÿ“ˆ 32.65 Punkte

๐Ÿ“Œ APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ CVE-2023-6656 | DeepFaceLab pretrained DF.wf.288res.384.92.72.22 DFLIMG/DFLJPG.py deserialization


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ CVE-2024-0654 | DeepFaceLab pretrained DF.wf.288res.384.92.72.22 mainscripts/Util.py deserialization


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ Learning to Detect Novel and Fine-Grained Acoustic Sequences Using Pretrained Audio Representations


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ This AI Paper Introduces a Comprehensive Analysis of Computer Vision Backbones: Unveiling the Strengths and Weaknesses of Pretrained Models


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ This AI Paper from Google DeepMind Studies the Gap Between Pretraining Data Composition and In-Context Learning in Pretrained Transformers


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ Tensoic AI Releases Kan-Llama: A 7B Llama-2 LoRA PreTrained and FineTuned on โ€˜Kannadaโ€™ Tokens


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ CVE-2024-1432 | DeepFaceLab pretrained DF.wf.288res.384.92.72.22 main.py apply_xseg deserialization


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages


๐Ÿ“ˆ 25.96 Punkte

๐Ÿ“Œ Bridging the Language Barrier: Playing a Visual Novel in Another Language with Google Translate's Camera


๐Ÿ“ˆ 25.11 Punkte

๐Ÿ“Œ simpleelf pypi module as a simple way for describing firmware layout in memory


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ I Tested How Good Is ChatGPT When Describing Code


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Describing Software Behavior


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Female Engineer Sues Tesla, Describing a Culture Of 'Pervasive Harassment'


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Apple Files Patent Describing Possible Siri Home Speaker


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Microsoft Makes iPhones Brilliant with App Describing the World for Blind People


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Researchers From MIT and Harvard University Present a Paper Describing a New System, Dubbed Veil, That Makes Private Browsing More Private


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ EU Justice Commissioner Quits Facebook, Describing Her Experience as 'Channel of Dirt'


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ HPR2879: Describing how I listen to podcasts PART 1


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ HPR2914: Describing how I listen to podcasts PART 4


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ HPR2889: Describing how I listen to podcasts PART 2


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ Will We Someday Write Code Just By Describing It?


๐Ÿ“ˆ 24.86 Punkte

๐Ÿ“Œ HPR2901: Describing how I listen to podcasts PART 3


๐Ÿ“ˆ 24.86 Punkte











matomo