Towards Automated Machine Learning: Hyperparameter Optimization in Online Clustering
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tartu Ülikool
Abstract
Machine Learning (ML) has demonstrated significant potential in data-driven applications,
particularly in real-time use cases through online ML, which processes data streams
and handles concept drift (changes in data distribution) dynamically. Automated ML
(AutoML) seeks to streamline ML pipeline tasks like hyperparameter optimization (HPO)
and model selection for improved performance. While some efforts have been made
to integrate online ML and AutoML, research on automated online clustering remains
limited. This thesis focuses on developing a potential HPO solution in online clustering
settings. The aim was to propose an ensemble-based approach that leverages more than
one internal clustering validation index (CVI) to address the evaluation problem in online
clustering. HPO was implemented on top of the river framework. To compare the
performance of HPO in online clustering, two online clustering algorithms were used
on six synthetic datasets with ground truth labels. In HPO, models were separately
optimized towards two internal CVIs, the Silhouette score and the Calinski-Harabasz
Index, and models were compared by using an external CVI, the Adjusted Rand Index. In
the experiments, (a) default online clustering algorithms with default parameters, (b) the
best optimized online clustering algorithms, and (c) the ensemble of the best optimized
models were compared. The findings revealed that the efficacy of HPO varies depending
on the data type. In k-centroid-based datasets, the Silhouette-optimized model and the
ensemble model outperformed other clustering solutions, while HPO and ensembling
did not yield superior results in S-curve datasets.
Description
Keywords
autoML, online ML, online clustering, hyperopt, river