Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos

Das, Saurav; Biswas, Shammo; Fahim, Taimoor; Sanjan, M.A.B. Siddique; Tarannum, Tasnia Alam

dc.contributor.advisor	Alam, Md. Ashraful
dc.contributor.advisor	Alam, Md. Golam Rabiul
dc.contributor.author	Das, Saurav
dc.contributor.author	Biswas, Shammo
dc.contributor.author	Fahim, Taimoor
dc.contributor.author	Sanjan, M.A.B. Siddique
dc.contributor.author	Tarannum, Tasnia Alam
dc.date.accessioned	2024-10-17T05:33:21Z
dc.date.available	2024-10-17T05:33:21Z
dc.date.copyright	©2024
dc.date.issued	2024-05
dc.identifier.other	ID 20101100
dc.identifier.other	ID 20101359
dc.identifier.other	ID 23241093
dc.identifier.other	ID 19201068
dc.identifier.other	ID 20301179
dc.identifier.uri	http://hdl.handle.net/10361/24342
dc.description	This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.	en_US
dc.description	Cataloged from PDF version of thesis.
dc.description	Includes bibliographical references (pages 52-55).
dc.description.abstract	Video understanding and description have an important role to play in the field of computer vision and natural language processing. The capacity of automatically generating natural language descriptions for video content has many real-world applications, for example, quoting accessibility tools up to multimedia retrieval systems. Although understanding and describing video content in natural language is a challenging job, it is more so in resource-constrained languages like Bangla. This study investigates the integration of a feature fusion method and the attention-based encoder-decoder framework to improve comprehension of videos and to generate accurate captions for single-action video clips in Bangla. We propose a novel model based on multimodal fusion by combining visual features from video frames and motion information derived from optical flow. The adopted multimodal representations are then fed into an attention-based encoder-decoder architecture aiming to generate descriptive captions in the Bangla language. To facilitate our research, we collected and annotated a new dataset comprising single-action videos sourced from various online platforms. Extensive experiments are conducted on this newly created Bangla single-action videos dataset, with the models evaluated using standard metrics like BLEU, METEOR, and CIDEr. Among the models tested, including architectural variations, the GRU-Gaussian Attention model achieves the best performance, generating captions closest to the ground truth. As this is a new dataset with no previous benchmarks, the proposed approach establishes a strong baseline for Bangla video captioning, achieving a BLEU score of 0.53 and a CIDEr score of 0.492. Additionally, we analyze the attention mechanisms to interpret the learned representations, providing insights into the model’s behavior and decision-making process. This work on developing solutions for under-resourced languages paves the way for enhanced video comprehension with potential applications in human-computer interaction, accessibility, and multimedia retrieval.	en_US
dc.description.statementofresponsibility	Saurav Das
dc.description.statementofresponsibility	Shammo Biswas
dc.description.statementofresponsibility	Taimoor Fahim
dc.description.statementofresponsibility	M.A.B. Siddique Sanjan
dc.description.statementofresponsibility	Tasnia Alam Tarannum
dc.format.extent	65 pages
dc.language.iso	en	en_US
dc.publisher	Brac University	en_US
dc.rights	Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission.
dc.subject	Video captioning	en_US
dc.subject	Bangla language	en_US
dc.subject	Video processing	en_US
dc.subject	Natural language processing	en_US
dc.subject	Feature fusion	en_US
dc.subject	Encoder-decoder framework	en_US
dc.subject	Multimodal fusion	en_US
dc.subject	GRU-Gaussian attention model	en_US
dc.subject	CIDEr score	en_US
dc.subject.lcsh	Natural language processing (Computer science).
dc.subject.lcsh	Neural networks (Computer science).
dc.title	Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos	en_US
dc.type	Thesis	en_US
dc.contributor.department	Department of Computer Science and Engineering, Brac University
dc.description.degree	B.Sc. in Computer Science

Files in this item

Name:: 20101100, 20101359, 23241093, ...
Size:: 829.9Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Thesis & Report, BSc (Computer Science and Engineering) [1589]

Show simple item record