Text classification with an efficient preprocessing technique for cross-language and multilingual data
Abstract
The procedure of eradicating extraneous textual elements and preparing or process-
ing the values to be fed into the classifier model is often indicates the concept of
text-preprocessing. There are several preprocessing methods, however not all of
them are effective when used with cross-language and multilingual datasets. Run-
ning a cross-lingual or multilingual dataset through a single pre-processing method
and text classification model is rather challenging. What if a technique could be
used to better classify data from multilingual and cross lingual datasets? In order
to accelerate the process of improving accuracy, we tested various combinations of
data pre-processing with text classification models on datasets in Bangla, English,
and cross-lingual (Native language written in English letters). We may infer from
our experiment that mLSTM functioned effectively for datasets in Bangla and English. Thus, mLSTM can be a helpful preprocessing method for datasets containing
a variety of languages.