An autoencoder-based decentralized clustering leveraging model aggregation fusion strategy

Hasan, Nahid

View/Open

22366036_CSE.pdf (747.5Kb)

Date

2024-05

Publisher

Brac University

Abstract

Unsupervised clustering plays a crucial role in various real-life applications. It works by grouping similar data points together based on certain features or characteristics, without the use of predefined labels. The process generally starts with gathering data in a centralized system that are to be clustered. This data could be in the form of numerical features, text, images, or any other type of information. The exponential expansion of digital transformation, the Internet of Things (IoT), social media, and online platforms has precipitated an unprecedented surge in data generation. This proliferation is characterized by an incessant stream of information flowing from various sources, encompassing user interactions, sensor readings, online transactions, and more. This deluge of data poses both challenges and opportunities for businesses, governments, and individuals alike. The ever-increasing amount of data poses both opportunities and challenges. So, gathering, managing, processing this amount of data in a centralized system requires time and is a very tough process. Additionally, concerns related to data privacy, security, and ethical considerations become more prominent as data volumes continue to grow. Moreover, it’s important to respect individuals’ privacy rights and adhere to relevant data protection laws and regulations. Federated learning addresses concerns about data volume and privacy by leaving user data on devices. Federated unsupervised representation learning is an architecture that pre-traines deep neural networks utilizing unlabeled input in a federated fashion via unsupervised representation learning. In centralized settings, model-based clustering approaches demonstrate significant effectiveness. These methods rely on statistical models to identify underlying patterns and group data points accordingly. By leveraging sophisticated algorithms, modelbased clustering can efficiently handle complex data structures and accurately partition datasets into meaningful clusters. This approach enables centralized systems to efficiently organize and analyze large volumes of data, facilitating insights and decision-making processes across various domains. Moreover, model-based clustering offers flexibility in accommodating different data distributions and can adapt to diverse clustering requirements, making it a versatile tool for centralized data analysis tasks. In contrast to the centralized setup, this way of clustering in federated settings is still relatively unexplored, maybe because training models in a highly diversified context using the FedAvg method is more difficult. The normalizing flow model is used by the recently announced Unsupervised Iterative Federated Clustering (UIFCA) Algorithm to perform clustering on unlabeled datasets in federated environments. The IFCA framework, which tackles the problem of very varied settings, is the foundation of UIFCA. A novel approach for decentralized clustering utilizing proposed model parameter aggregation strategy FednadamN in conjunction with the deep generative model autoencoder is introduced. FednadamN combines the benefits of two cutting-edge optimization methods for federated learning: Adam and Nadam. Adam optimization offers quick convergence and resilience to noisy data by using adaptive learning rates based on the first and second moments of gradients. Adam is expanded by Nadam with the use of Nesterov accelerated gradients, hence increasing the stability and speed of convergence. The method addresses the challenge of clustering in decentralized settings by leveraging the collective intelligence of distributed nodes while preserving data privacy and minimizing communication overhead. By aggregating model parameters across decentralized nodes and employing Autoencoder-based representations, efficient clustering is enabled efficient clustering without the need for central data storage or coordination. This approach promises to enhance scalability, privacy, and performance in decentralized clustering tasks across various domains. Additionally, a comparison between the tailored approach and the current technique using benchmark datasets is offered. The following four benchmark datasets were used: image segmentation, protein localization, letter image recognition, and vowel deterrence. The suggested technique for clustering letter image recognition data has produced the greatest mutual information score of 1.192 and highest v measure score of 0.373 using the kmeans algorithm. However, FedAvg’s fuzzy k means algorithm yields the highest rand index score of 0.925. The proposed approach for clustering Deterding Vowel Recognition Data has the highest v measure score of 0.264 and the highest rand index score of 0.850 when using the kmeans algorithm; however, it performs less well than FedAdam, which uses the minibatch kmeans algorithm to show a v measure score of 0.258. The proposed approach for clustering Protein Localization Data yields the greatest rand score 0.774 , highest mutual info score 0.908 , and highest v measure score 0.527 while utilizing the minibatch kmeans algorithm. The proposed method for clustering Image Segmentation data yields the greatest mutual information score of 1.084, the highest rand score of 0.849, and the highest v measure score of 0.565 when utilizing the minibatch kmeans algorithm. This result demonstrates the suggested approach’s improved performance and its potential applicability for various clustering goals. The enhanced efficiency of this method makes it a valuable tool for diverse clustering tasks. Its robustness and adaptability underscore its utility in different contexts. Moreover, the approach’s superior outcomes suggest broader relevance across multiple domains.

Keywords

Federated learning; Unsupervised learning; Auto-encoder; Fusion strategy

LC Subject Headings

Multisensor data fusion.; Electronic data processing--Distributed processing.; Neural networks (Computer science).; Data mining

Description

This thesis is submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, 2024.

Cataloged from the PDF version of the thesis.

Includes bibliographical references (pages 53-55).

Department

Department of Computer Science and Engineering, Brac University

Type

Thesis

Collections

Thesis & Report, MSc (Computer Science and Engineering) [87]