Automated extraction of structured information from a variety of web pages

The expediency of using methods of structured data extraction from a set of HTML pages for the information search in the Internet is substantiated. The main methods of structured data extraction from the set of web pages, which are formed by a common scenario with different sets of data, are analyze...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2018
Автори: Pogorilyy, S.D., Kramov, A.A.
Формат: Стаття
Мова:Ukrainian
Опубліковано: Інститут програмних систем НАН України 2018
Теми:
Онлайн доступ:https://pp.isofts.kiev.ua/index.php/ojs1/article/view/277
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Problems in programming
Завантажити файл: Pdf

Репозитарії

Problems in programming
Опис
Резюме:The expediency of using methods of structured data extraction from a set of HTML pages for the information search in the Internet is substantiated. The main methods of structured data extraction from the set of web pages, which are formed by a common scenario with different sets of data, are analyzed. The classification of methods according to the degree of automation (the factor of user influence) of the template formation process is considered. The principles of work of the main unsupervised methods (Roadrunner, FiVaTech, Trinity) are described in detail. Advantages and disadvantages of methods are shown. The expediency of using the Trinity method for data extraction in comparison with other methods is substantiated. The problem of choosing input documents for method among a set of HTML pages for generating a common template is considered. Experimental verification of Trinity method on the set of HTML pages, which represent articles of Ukrainian scientific journals, is made. To create a test set of HTML pages, an automated crawl of web site is performed. The realization of the search bot is done by processing the object model of HTML documents obtained from web sites. Templates (regular expressions) formed by the Trinity method are applied to the entire set of input HTML pages. Extraction results (structured data about articles) are exported to the database with the possibility of further analysis. The obtained results are compared with the data about the articles obtained by the manual analysis of the object model of web pages. The error in using the Trinity method on the experimental set of HTML pages is calculated.Problems in programming 2018; 2-3: 149-158