Масштабована кластеризація текстових даних на основі вкладення слів та аналіз шуму

Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance met...

Full description

Saved in:

Bibliographic Details
Date:	2026
Main Authors:	Shutiak, Dmytro, Podkolzin, Gleb, Pokhylenko, Oleksandr
Format:	Article
Language:	English
Published:	The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026
Subjects:	текстова кластеризація вкладення слів великі мовні моделі машинне навчання Python
Online Access:	https://journal.iasa.kpi.ua/article/view/365268
Tags:	Add Tag No Tags, Be the first to tag this record!
Journal Title:	System research and information technologies
Download file:

Institution

System research and information technologies

Description
Summary:	Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance metrics from the Minkowski family (L1, L2, L∞) and parameters specifically tailored for clustering unstructured text data. A major contribution is a novel evaluation metric based on the relative point density of identified clusters and surrounding noise formations (“clouds”). Beyond assessing overall clustering quality, this metric highlights problematic dense accumulations within the noise that require additional manual analysis. Experimental evaluation on the “20 Newsgroups” dataset demonstrated that clustering quality is independent of the α parameter but highly sensitive to the distance metric, with L∞ yielding the best results. The nomic-embedding-v1 model significantly outperformed gte-v1.5 in both the silhouette score and the proposed relative density metric.
DOI:	10.20535/SRIT.2308-8893.2026.2.10

Масштабована кластеризація текстових даних на основі вкладення слів та аналіз шуму

Institution

Similar Items