Utility of large language models to extract commonsense knowledge
Abstract
Large Language models are artificial intelligence models that hold the capability
to understand and generate natural language text as they are trained using large
amounts of data for a lot of languages. The sources these models are trained on
include books, articles, websites, and many more. As the large language models
know the languages along with their syntax and structures thoroughly, we can expect
them to work well for the Bengali language and compose enough knowledge related
to the Bengali culture. One of the challenges of working with the Bengali language is
the lack of Natural Language Processing methods such as Semantic Parsing, Parts
of Speech tagging, and Named Entity Recognition. Our motive was to test the
effectiveness of large language models in answering Bengali culture and languagebased
queries, alongside analyzing which fields of knowledge require improvement.
As we do not need Natural Language Processing tools while working with large
language models, these models could serve our purpose. Therefore, through our
research, we formed a corpus to analyze the utility of large language models for the
Bengali language. This corpus aided us in recognizing the gaps of the large language
models in terms of factual and cultural commonsense knowledge through natural
language processing tasks such as question-answering and masked prediction.