Inference of gene regulatory metwork (GRN) rrom gene expression data using k-means clustering and entropy based selection of interactions
Abstract
Inferring regulatory network from gene expression data only is considered a challenging task in systems biology, and the introduction of various high-throughput DNA microarray technologies in the collection of expression data has significantly increased the amount of data to be analyzed by existing algorithms. All of these algorithms focus on different issues regarding the inference of gene regulatory network (GRN) and their methodologies work better only for certain types of datasets and/or regulatory networks. As a result, they have inherent limitations in dealing with different types of datasets. In this paper, we propose a novel method to infer gene regulatory network from expression data which utilizes K-means Clustering along with some properties of entropy from information theory. The proposed method has two main components, first grouping the genes of a dataset into given number of clusters and then finding statistically significant interactions among genes of each individual cluster and selected nearby clusters. To achieve this, an information theoretic approach based on Entropy Reduction is used to finally generate a regulatory interaction matrix consisting of all genes. The purpose of grouping genes in clusters based on the similarity of expression level is to minimize the search space of regulatory interactions among genes. The Entropy Reduction Technique (ERT) finds regulatory interactions with reduced number of genes. To assess the performance of our algorithm, we used datasets from DREAM5 – Network Inference challenge [6], DREAM4 – In Silico Network challenge [7] and one in silico dataset generated by GeneNetWeaver [8]. The performance of our algorithm was compared with the result of ARACNE, a popular information theoretic approach to reverse engineer gene regulatory network from expression dataset. We used precision and recall as performance measures. Our algorithm showed significant improvement in the precision and recall percentage over the network generated by ARACNE. We also compared our results among different threshold values and different numbers of clusters with three versions of our algorithm -No Clustering, Unmerged Clustering and Selected Merged Clustering.