Порівняння представлень k-мер-даних ДНК для класифікації через нейронні мережі
Classifying DNA sequences as healthy or diseased is a crucial task in genomics, with significant implications for understanding genetic disorders and developing precision medicine. Neural networks have emerged as a powerful tool for this classification due to their ability to model complex patterns...
Збережено в:
| Дата: | 2024 |
|---|---|
| Автор: | |
| Формат: | Стаття |
| Мова: | Ukrainian |
| Опубліковано: |
V.M. Glushkov Institute of Cybernetics of NAS of Ukraine
2024
|
| Теми: | |
| Онлайн доступ: | https://jais.net.ua/index.php/files/article/view/408 |
| Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
| Назва журналу: | Problems of Control and Informatics |
Репозитарії
Problems of Control and Informatics| Резюме: | Classifying DNA sequences as healthy or diseased is a crucial task in genomics, with significant implications for understanding genetic disorders and developing precision medicine. Neural networks have emerged as a powerful tool for this classification due to their ability to model complex patterns in large datasets. A foundational step in this process involves representing DNA sequences as sets of k-mers, which are subsequences of a fixed length (k). This study evaluates and compares two methods for representing k-mer data. The first method employs a binary feature vector, where each possible k-mer corresponds to a binary feature. This representation, while straightforward, results in high-dimensional and sparse feature vectors, leading to substantial memory requirements and potential computational inefficiencies. The second method is based on the Conway–Bromage–Lyndon (CBL) structure, which introduces a compressed and dynamic representation of k-mers. By leveraging the smallest cyclic rotations, or necklaces, the CBL method reduces redundancy and optimizes data storage. We analyze these methods across three key metrics: memory usage, computational efficiency, and classification performance using neural networks. The CBL-based method consistently demonstrates superior memory efficiency by significantly reducing the memory footprint required to store k-mer features. It also achieves faster feature vector generation times, addressing the computational challenges posed by the binary feature vector approach. In terms of classification accuracy, the CBL-based method performs comparably, with slight improvements in some cases, highlighting its capacity to capture meaningful sequence features effectively. Our findings underscore the advantages of the CBL-based k-mer representation, making it a promising alternative for large-scale genomic analyses where both memory and computational resources are critical constraints. |
|---|