Comparison of Unigram, Bigram, HMM and Brill's POS tagging approaches for some South Asian languages
Abstract
Part-of-Speech (POS) Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags. POS Tagging is important in various areas of Natural Language Processing. Different methods of automating the process have been developed and employed for English and other Western languages. Some similar work, most of which utilize the stochastic approaches for POS Tagging has also been done in the same area for South Asian languages. We experimented with some of the widely-used approaches for POS Tagging on three South Asian languages, Bangla, Hindi and Telegu, using corpora of different sizes. We observed the performance of the approaches and found the Brill’s transformation based tagger’s performance to be superior to the other approaches in all of our experiments, though the use of this approach has been very limited until recently.