dc.contributor.advisor | Alam, Md. Ashraful | |
dc.contributor.advisor | Alam, Md. Golam Rabiul | |
dc.contributor.author | Das, Saurav | |
dc.contributor.author | Biswas, Shammo | |
dc.contributor.author | Fahim, Taimoor | |
dc.contributor.author | Sanjan, M.A.B. Siddique | |
dc.contributor.author | Tarannum, Tasnia Alam | |
dc.date.accessioned | 2024-10-17T05:33:21Z | |
dc.date.available | 2024-10-17T05:33:21Z | |
dc.date.copyright | ©2024 | |
dc.date.issued | 2024-05 | |
dc.identifier.other | ID 20101100 | |
dc.identifier.other | ID 20101359 | |
dc.identifier.other | ID 23241093 | |
dc.identifier.other | ID 19201068 | |
dc.identifier.other | ID 20301179 | |
dc.identifier.uri | http://hdl.handle.net/10361/24342 | |
dc.description | This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024. | en_US |
dc.description | Cataloged from PDF version of thesis. | |
dc.description | Includes bibliographical references (pages 52-55). | |
dc.description.abstract | Video understanding and description have an important role to play in the field of
computer vision and natural language processing. The capacity of automatically
generating natural language descriptions for video content has many real-world applications,
for example, quoting accessibility tools up to multimedia retrieval systems.
Although understanding and describing video content in natural language is
a challenging job, it is more so in resource-constrained languages like Bangla. This
study investigates the integration of a feature fusion method and the attention-based
encoder-decoder framework to improve comprehension of videos and to generate accurate
captions for single-action video clips in Bangla. We propose a novel model
based on multimodal fusion by combining visual features from video frames and
motion information derived from optical flow. The adopted multimodal representations
are then fed into an attention-based encoder-decoder architecture aiming to
generate descriptive captions in the Bangla language. To facilitate our research, we
collected and annotated a new dataset comprising single-action videos sourced from
various online platforms. Extensive experiments are conducted on this newly created
Bangla single-action videos dataset, with the models evaluated using standard
metrics like BLEU, METEOR, and CIDEr. Among the models tested, including
architectural variations, the GRU-Gaussian Attention model achieves the best performance,
generating captions closest to the ground truth. As this is a new dataset
with no previous benchmarks, the proposed approach establishes a strong baseline
for Bangla video captioning, achieving a BLEU score of 0.53 and a CIDEr score of
0.492. Additionally, we analyze the attention mechanisms to interpret the learned
representations, providing insights into the model’s behavior and decision-making
process. This work on developing solutions for under-resourced languages paves
the way for enhanced video comprehension with potential applications in human-computer
interaction, accessibility, and multimedia retrieval. | en_US |
dc.description.statementofresponsibility | Saurav Das | |
dc.description.statementofresponsibility | Shammo Biswas | |
dc.description.statementofresponsibility | Taimoor Fahim | |
dc.description.statementofresponsibility | M.A.B. Siddique Sanjan | |
dc.description.statementofresponsibility | Tasnia Alam Tarannum | |
dc.format.extent | 65 pages | |
dc.language.iso | en | en_US |
dc.publisher | Brac University | en_US |
dc.rights | Brac University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. | |
dc.subject | Video captioning | en_US |
dc.subject | Bangla language | en_US |
dc.subject | Video processing | en_US |
dc.subject | Natural language processing | en_US |
dc.subject | Feature fusion | en_US |
dc.subject | Encoder-decoder framework | en_US |
dc.subject | Multimodal fusion | en_US |
dc.subject | GRU-Gaussian attention model | en_US |
dc.subject | CIDEr score | en_US |
dc.subject.lcsh | Natural language processing (Computer science). | |
dc.subject.lcsh | Neural networks (Computer science). | |
dc.title | Enhancing Bangla video comprehension through multimodal feature integration and attention-based encoder-decoder captioning models for single-action videos | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Department of Computer Science and Engineering, Brac University | |
dc.description.degree | B.Sc. in Computer Science | |