Identifying hate speech of Bangla language text using natural language processing

Rahman, Mushfiqur; Jui, Razia Sultana; Sakib, Chowdhury Nazmuz; Ridoy, Fahim Alavi; Ananya, Taskiea Tabassum

View/Open

18301121, 18301021, 18301109, 19301071, 19301192_CSE.pdf (268.3Kb)

Date

2024-01

Publisher

Brac University

Abstract

In this era of the internet, sharing information through social media has provided significant benefits to humans. People can easily access and observe others’ lifestyles and work, as well as make comments or share thoughts about them. However, this practice also brings challenges, such as the spread of hate comments, abusive online criticism, spreading toxicity by giving hate comments etc. The internet’s flexibility and anonymity have created a culture where users find it easy to express themselves aggressively in communication. As the amount of hate speech is increasing, there is a need for a method to automatically detect hate speech. To tackle this concern, recent research has utilized diverse feature engineering methods and machine learning algorithms to autonomously identify hate speech messages across various datasets.Since it is related to Natural Language Processing (NLP), our goal is to utilize NLP to detect hate speeches and demonstrate how Deep Learning and ML can be used in this case.. Since there are more than 7,100 languages spoken throughout the world, we have chosen the Bengali language as our dataset language. Additionally, with the help of machine learning and deep learning, we will train our model to automatically detect hate speech. We are utilizing Multinomial Naive Bayes, RNN, Random Forest, Logistic Regression, Decision Tree Classifier, CNN-LSTM Hybrid algorithm and Multi lingual Bidirectional Encoder Representations(mBert) for result comparison and optimal outcomes and accuracy. After employing all the above algorithms, we found the highest accuracy using the mBert for the binary classification, which is 90.00%. On the other hand, for multiclass classifications, we have found the highest accuracy using CNN-LSTM Hybrid algorithm, which is 64% and the second highest is 62% using mBert. We are committed to further improving these results.

Keywords

Bangla language; Natural language processing; Machine learning; Deep learning; Offensive language

LC Subject Headings

Natural language processing (Computer science).; Automatic speech recognition.; Deep learning (Machine learning).

Description

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2024.

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 22-24).

Department

Department of Computer Science and Engineering, Brac University

Type

Thesis

Collections

Thesis & Report, BSc (Computer Science and Engineering) [1586]