Extracting structure from text documents based on machine learning

This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Проблеми програмування
Datum:	2022
Hauptverfasser:	Kudim, K.A., Proskudina, G.Yu.
Format:	Artikel
Sprache:	Englisch
Veröffentlicht:	Інститут програмних систем НАН України 2022
Schlagworte:	Моделі і засоби систем баз даних та знань
Online Zugang:	https://nasplib.isofts.kiev.ua/handle/123456789/188639
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!
Назва журналу:	Digital Library of Periodicals of National Academy of Sciences of Ukraine
Zitieren:	Extracting structure from text documents based on machine learning / K.A. Kudim, G.Yu. Proskudina // Проблеми програмування. — 2022. — № 3-4. — С. 154-160. — Бібліогр.: 5 назв. — англ.

Institution

Digital Library of Periodicals of National Academy of Sciences of Ukraine

_version_	1862730713489473536
author	Kudim, K.A. Proskudina, G.Yu.
author_facet	Kudim, K.A. Proskudina, G.Yu.
citation_txt	Extracting structure from text documents based on machine learning / K.A. Kudim, G.Yu. Proskudina // Проблеми програмування. — 2022. — № 3-4. — С. 154-160. — Бібліогр.: 5 назв. — англ.
collection	DSpace DC
container_title	Проблеми програмування
description	This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure. Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових документів за допомогою штучної нейронної мережі. Метод складається з підготовки даних, побудови та навчання моделі та оцінки результатів. Підготовка даних включає збирання корпусів документів, перетворення різних форматів файлів у звичайний текст і ручне маркування структури кожного документа. Потім документи розбиваються на слова та абзаци. Абзаци тексту представлені як вектори ознак для забезпечення вхідних даних для нейронної мережі. Модель навчена та перевірена на вибраних підмножинах даних. Представлена оцінка результатів навченої моделі. Остаточна ефективність розраховується для кожної мітки з використанням F1-оцінки, точності та повноти, а також загального середнього значення. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру.
first_indexed	2025-12-07T19:21:55Z
format	Article
fulltext
id	nasplib_isofts_kiev_ua-123456789-188639
institution	Digital Library of Periodicals of National Academy of Sciences of Ukraine
issn	1727-4907
language	English
last_indexed	2025-12-07T19:21:55Z
publishDate	2022
publisher	Інститут програмних систем НАН України
record_format	dspace
spelling	Kudim, K.A. Proskudina, G.Yu. 2023-03-10T18:57:03Z 2023-03-10T18:57:03Z 2022 Extracting structure from text documents based on machine learning / K.A. Kudim, G.Yu. Proskudina // Проблеми програмування. — 2022. — № 3-4. — С. 154-160. — Бібліогр.: 5 назв. — англ. 1727-4907 DOI: https://doi.org/10.15407/pp2022.03-04.154 https://nasplib.isofts.kiev.ua/handle/123456789/188639 004.82 This study is devoted to a method that facilitates the task of extracting structure from the text documents using an artificial neural network. The method consists of data preparation, building and training the model and results evaluation. Data preparation includes collecting corpora of documents, converting a variety of file formats into plain text, and manual labeling each document structure. Then documents are split into tokens and into paragraphs. The text paragraphs are represented as feature vectors to provide input to the neural network. The model is trained and validated on the selected data subsets. Trained model results evaluation is presented. The final performance is calculated per label using precision, recall, and F1 measures, and overall average. The trained model can be used to extract sections of documents bearing similar structure. Дослідження присвячене методу, що вирішує задачу автоматичного витягу структури з слабо структурованих текстових документів за допомогою штучної нейронної мережі. Метод складається з підготовки даних, побудови та навчання моделі та оцінки результатів. Підготовка даних включає збирання корпусів документів, перетворення різних форматів файлів у звичайний текст і ручне маркування структури кожного документа. Потім документи розбиваються на слова та абзаци. Абзаци тексту представлені як вектори ознак для забезпечення вхідних даних для нейронної мережі. Модель навчена та перевірена на вибраних підмножинах даних. Представлена оцінка результатів навченої моделі. Остаточна ефективність розраховується для кожної мітки з використанням F1-оцінки, точності та повноти, а також загального середнього значення. Навчену модель можна використовувати для витягу розділів документів, що мають подібну структуру. en Інститут програмних систем НАН України Проблеми програмування Моделі і засоби систем баз даних та знань Extracting structure from text documents based on machine learning Витяг структури з текстових документів на основі машинного навчання Article published earlier
spellingShingle	Extracting structure from text documents based on machine learning Kudim, K.A. Proskudina, G.Yu. Моделі і засоби систем баз даних та знань
title	Extracting structure from text documents based on machine learning
title_alt	Витяг структури з текстових документів на основі машинного навчання
title_full	Extracting structure from text documents based on machine learning
title_fullStr	Extracting structure from text documents based on machine learning
title_full_unstemmed	Extracting structure from text documents based on machine learning
title_short	Extracting structure from text documents based on machine learning
title_sort	extracting structure from text documents based on machine learning
topic	Моделі і засоби систем баз даних та знань
topic_facet	Моделі і засоби систем баз даних та знань
url	https://nasplib.isofts.kiev.ua/handle/123456789/188639
work_keys_str_mv	AT kudimka extractingstructurefromtextdocumentsbasedonmachinelearning AT proskudinagyu extractingstructurefromtextdocumentsbasedonmachinelearning AT kudimka vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ AT proskudinagyu vitâgstrukturiztekstovihdokumentívnaosnovímašinnogonavčannâ

Extracting structure from text documents based on machine learning

Institution

Ähnliche Einträge