Document template identification and data extraction using machine learning and deep learning approach
Abstract
As the world keeps progressing and we continue on our path to a technologically
advanced tomorrow, the demand for quick data processing and organization is becoming
more and more necessary. People now have access to technology more than
ever before. Nowadays, technology allows for the processing and storing of nearly
every kind of data. However, procedures requiring paper are still in place and the
time-consuming process of moving these data from paper to computers is laborious
which reduces work efficiency. Our goal is to make this tedious and time-consuming
process fast and efficient, by directly converting the information of the manually
checked scripts into digital data. Our research strategy involved gathering information
from Brac University examination scripts, digitizing the verified scripts’ data,
and then uploading it to a spreadsheet file. The goal of the process is to make Brac
University’s grade-processing system quicker, more effective, and less tiresome for
the teachers. Three machine learning models and three deep learning models as well
as one transfer learning model were utilized for this study. Three common measures
were used to evaluate the results which are precision, recall and F1-score. The
KNN model showed up to 85% accuracy, whilst SVM showed 87% and SGDClassifier
showed 81% accuracy. Meanwhile CNN and YOLOv8 showed 98.6% and 98.8%
accuracy respectively. Since YOLOv8 is providing the best accuracy, we will be using
this to create an interface that will carry out the complete data transformation
process from beginning to end. Starting with capturing the image, processing it to
identify the areas from which the data will be collected, and finally extracting the
data, in the entire process YOLOv8 is going to be used. In the end, we will obtain
precisely extracted data from handwritten exam scripts, which will be arranged in
a spreadsheet, digitizing the laborious task of manually inputting each and every
grade in a spreadsheet.