Analyzing Schizophrenic-prone text from social media content: a novel approach through ML and NLP
Abstract
Schizophrenia is one of the destructive personality disorders where people have unusual
interpretations of reality and are lured to develop harmful actions if not diagnosed
promptly. This study focuses on identifying language patterns indicative of
schizophrenic-prone texts in online communication and intends to contribute to the
development of early intervention techniques in mental health utilizing ML and NLP
methods. This study used two datasets to examine language patterns associated
with schizophrenia in social media posts. The first dataset, Pre existing obtained
from a repository focused on identifying schizophrenia-related postings, functions
as a standard for comparison and evaluation. The second dataset, New scrapped
obtained by extracting information from subreddits associated with schizophrenia,
offers a more extensive range of language patterns. The dual-phase technique entails
training models using the existing dataset and evaluating their performance on the
newly collected dataset. The research uses various models, including transformer
model BERT, recurrent neural network model Bi-LSTM, and GRU, as well as machine
learning models such as Support Vector Classifier, Logistic Regression, Multinomial
Naive Bayes, Random Forest, and Decision Tree to predict whether textual
data is suggestive of schizophrenia. The language patterns of schizophrenic-prone
texts differ from texts written by mentally-healthy individuals, encompassing phonological,
morphological, and syntactic aspects. These models can analyze linguistic
patterns and acquire knowledge about them. The results achieved after the training
of the models are outstanding. The DistilBERT transformer model achieves 97%
and 84% accuracy, GRU achieves high accuracy rates of 91% and 79%, the logistic
regression machine learning model demonstrates impressive efficiency with accuracy
rates of 93% and 83% respectively for Pre existing and New scrapped dataset. In
order to ensure the models can effectively handle new data, we conducted a contemporary
comparison. This analysis revealed that consistent data collection is
necessary for accurate predictive results.