Tri-modal ensemble for enhanced Bangla sign language recognition
Abstract
Sign language is the most common method of communication for people with disabling
hearing loss. Bangladesh, where BdSL is prominently used among the disabling
people, finds communicating with the general mass challenging. Thus, a system
to understand BdSL accurately and efficiently has become a popular demand.
Deep learning architectures such as CNN, ANN, RNN, and Axis Independent LSTM
can interpret Bangla Sign Language into readable digital wording. Commonly, an
image-based sign language recognition system contains a recording camera that continuously
sends images to a model. The model then provides a prediction based on
those images. However, it creates a lot of uncertainty variables, such as the lighting
issue, noisy background, skin color, and hand orientations. To this end, we
propose a procedure that can reduce this uncertainty variable by considering three
different modalities, spatial information, skeleton awareness, and edge awareness.
We propose three image pre-processing techniques and integrate three convolutional
neural network models. Finally, we tested out nine different ensemble meta-learning
algorithms where five of the algorithms are modifications of averaging and voting
techniques. As a result, our proposed model achieved higher training accuracy at
99.77%, 98.11%, and 99.30% than any other state-of-the-art image classification architectures
except for ResNet50 at 99.87%. We achieved the highest accuracy of
95.13% on the testing set. This research shows that considering multiple modalities
can improve the system’s overall performance.