2025-02-22T10:40:34-05:00 DEBUG: VuFindSearch\Backend\Solr\Connector: Query fl=%2A&wt=json&json.nl=arrarr&q=id%3A%22pp_isofts_kiev_ua-article-277%22&qt=morelikethis&rows=5
2025-02-22T10:40:34-05:00 DEBUG: VuFindSearch\Backend\Solr\Connector: => GET http://localhost:8983/solr/biblio/select?fl=%2A&wt=json&json.nl=arrarr&q=id%3A%22pp_isofts_kiev_ua-article-277%22&qt=morelikethis&rows=5
2025-02-22T10:40:34-05:00 DEBUG: VuFindSearch\Backend\Solr\Connector: <= 200 OK
2025-02-22T10:40:34-05:00 DEBUG: Deserialized SOLR response

Automated extraction of structured information from a variety of web pages

The expediency of using methods of structured data extraction from a set of HTML pages for the information search in the Internet is substantiated. The main methods of structured data extraction from the set of web pages, which are formed by a common scenario with different sets of data, are analyze...

Full description

Saved in:
Bibliographic Details
Main Authors: Pogorilyy, S.D., Kramov, A.A.
Format: Article
Language:Ukrainian
Published: Інститут програмних систем НАН України 2018
Subjects:
Online Access:https://pp.isofts.kiev.ua/index.php/ojs1/article/view/277
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The expediency of using methods of structured data extraction from a set of HTML pages for the information search in the Internet is substantiated. The main methods of structured data extraction from the set of web pages, which are formed by a common scenario with different sets of data, are analyzed. The classification of methods according to the degree of automation (the factor of user influence) of the template formation process is considered. The principles of work of the main unsupervised methods (Roadrunner, FiVaTech, Trinity) are described in detail. Advantages and disadvantages of methods are shown. The expediency of using the Trinity method for data extraction in comparison with other methods is substantiated. The problem of choosing input documents for method among a set of HTML pages for generating a common template is considered. Experimental verification of Trinity method on the set of HTML pages, which represent articles of Ukrainian scientific journals, is made. To create a test set of HTML pages, an automated crawl of web site is performed. The realization of the search bot is done by processing the object model of HTML documents obtained from web sites. Templates (regular expressions) formed by the Trinity method are applied to the entire set of input HTML pages. Extraction results (structured data about articles) are exported to the database with the possibility of further analysis. The obtained results are compared with the data about the articles obtained by the manual analysis of the object model of web pages. The error in using the Trinity method on the experimental set of HTML pages is calculated.Problems in programming 2018; 2-3: 149-158