Towards santali linguistic inclusion: building the first Santali-to-English translation model using mT5 transformer and data augmentation
Date
2023-09Publisher
Brac UniversityAuthor
Billah, Syed Mohammed MostaqueSubarna, Ateya Ahmed
Sarna, Sudipta Nandi
Wasit, Ahmad Shawkat
Shawkat, Ahmad
Metadata
Show full item recordAbstract
Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak
Santali, positioning it as nearly the third most commonly used Austroasiatic language.
Despite its prominence among the Austroasiatic language family’s Munda
subfamily, Santali lacks global recognition. Currently, no translation models exist
for the Santali language. This paper aims to remove Santali from the NPL spectrum.
We aim to examine the feasibility of building Santali-English translation
models based on available Santali corpora. This paper successfully addressed the
low-resource problem and, with promising results, examined the possibility of using
the Santali language. We think that our study will open the door for further exploration
into Santali-English machine translation.