Порівняння представлень k-мер-даних ДНК для класифікації через нейронні мережі

Classifying DNA sequences as healthy or diseased is a crucial task in genomics, with significant implications for understanding genetic disorders and developing precision medicine. Neural networks have emerged as a powerful tool for this classification due to their ability to model complex patterns...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2024
Автор: Terpilovskyi, Yehor
Формат: Стаття
Мова:Ukrainian
Опубліковано: V.M. Glushkov Institute of Cybernetics of NAS of Ukraine 2024
Теми:
Онлайн доступ:https://jais.net.ua/index.php/files/article/view/408
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Problems of Control and Informatics

Репозитарії

Problems of Control and Informatics
Опис
Резюме:Classifying DNA sequences as healthy or diseased is a crucial task in genomics, with significant implications for understanding genetic disorders and developing precision medicine. Neural networks have emerged as a powerful tool for this classification due to their ability to model complex patterns in large datasets. A foundational step in this process involves representing DNA sequences as sets of k-mers, which are subsequences of a fixed length (k). This study evaluates and compares two methods for representing k-mer data. The first method employs a binary feature vector, where each possible k-mer corresponds to a binary feature. This representation, while straightforward, results in high-dimensional and sparse feature vectors, leading to substantial memory requirements and potential computational inefficiencies. The second method is based on the Conway–Bromage–Lyndon (CBL) structure, which introduces a compressed and dynamic representation of k-mers. By leveraging the smallest cyclic rotations, or necklaces, the CBL method reduces redundancy and optimizes data storage. We analyze these methods across three key metrics: memory usage, computational efficiency, and classification performance using neural networks. The CBL-based method consistently demonstrates superior memory efficiency by significantly reducing the memory footprint required to store k-mer features. It also achieves faster feature vector generation times, addressing the computational challenges posed by the binary feature vector approach. In terms of classification accuracy, the CBL-based method performs comparably, with slight improvements in some cases, highlighting its capacity to capture meaningful sequence features effectively. Our findings underscore the advantages of the CBL-based k-mer representation, making it a promising alternative for large-scale genomic analyses where both memory and computational resources are critical constraints.