ProteoKnight: phage virion protein classification with CNN and uncertainty quantification
Abstract
microbial ecosystems. This has led to their increased utilization in several research
areas, such as bacterial genome engineering, phage therapy, disease diagnostics, and
viral host identification. The structure of phages is made up of proteins called phage
virion proteins (PVP). Classifying these proteins is important for genomic research,
which in turn helps us understand the complex interactions between phages and their
hosts in the context of making antibacterial drugs. Replacing the tedious traditional
procedures, a growing number of computational strategies are being employed to annotate
phage protein sequences acquired using high-throughput sequencing. Among
these techniques, deep learning approaches demonstrate improved performance in
classification outcomes. Such procedures require special sequence encodings for the
model to perceive the protein sequences with their distinctive features. Numerous
ways have been examined and assessed, while novel methods continue to emerge in
order to optimize the task in terms of resource utilization and prediction accuracy.
The objective of our work, ProteoKnight, is to explore and develop a unique encoding
technique for phage proteins and demonstrate its effectiveness via classification. In
our work, we make use of the time-separated PVP dataset that [47] introduced. Furthermore,
this study aims to address the lack of research conducted on uncertainty
analysis by exploring the domain of uncertainty in binary PVP classification using
Monte Carlo Dropout (MCD) method. The experimental findings demonstrate the
effectiveness of our strategy for binary classification, achieving a prediction accuracy
of 90.2%. However, the accuracy for multi-class classification remains suboptimal.
Furthermore, our uncertainty analysis reveals that the class and sequence length
show variability in prediction confidence for our suggested classification approach.