Масштабована кластеризація текстових даних на основі вкладення слів та аналіз шуму
Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance met...
Saved in:
| Date: | 2026 |
|---|---|
| Main Authors: | , , |
| Format: | Article |
| Language: | English |
| Published: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2026
|
| Subjects: | |
| Online Access: | https://journal.iasa.kpi.ua/article/view/365268 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | System research and information technologies |
| Download file: | |
Institution
System research and information technologies| Summary: | Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance metrics from the Minkowski family (L1, L2, L∞) and parameters specifically tailored for clustering unstructured text data. A major contribution is a novel evaluation metric based on the relative point density of identified clusters and surrounding noise formations (“clouds”). Beyond assessing overall clustering quality, this metric highlights problematic dense accumulations within the noise that require additional manual analysis. Experimental evaluation on the “20 Newsgroups” dataset demonstrated that clustering quality is independent of the α parameter but highly sensitive to the distance metric, with L∞ yielding the best results. The nomic-embedding-v1 model significantly outperformed gte-v1.5 in both the silhouette score and the proposed relative density metric. |
|---|---|
| DOI: | 10.20535/SRIT.2308-8893.2026.2.10 |