The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier
Описані механізми навчання та оцінки якості роботи класифікатора в розроблюваній системі автоматизованої обробки великих об'ємів текстової інформації. Класифікатор базується на вільній бібліотеці LibSVM та методі опорних векторів. Система виконує функції пошуку, класифікації, рубрикації та клас...
Gespeichert in:
| Veröffentlicht in: | Математичні машини і системи |
|---|---|
| Datum: | 2014 |
| Hauptverfasser: | , |
| Format: | Artikel |
| Sprache: | Englisch |
| Veröffentlicht: |
Інститут проблем математичних машин і систем НАН України
2014
|
| Schlagworte: | |
| Online Zugang: | https://nasplib.isofts.kiev.ua/handle/123456789/84448 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Назва журналу: | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
| Zitieren: | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier / V.V. Lytvynov, O.P. Moyseenko // Математичні машини і системи. — 2014. — № 4. — 53-59. — Бібліогр.: 14 назв. — англ. |
Institution
Digital Library of Periodicals of National Academy of Sciences of Ukraine| _version_ | 1859641604775608320 |
|---|---|
| author | Lytvynov, V.V. Moyseenko, O.P. |
| author_facet | Lytvynov, V.V. Moyseenko, O.P. |
| citation_txt | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier / V.V. Lytvynov, O.P. Moyseenko // Математичні машини і системи. — 2014. — № 4. — 53-59. — Бібліогр.: 14 назв. — англ. |
| collection | DSpace DC |
| container_title | Математичні машини і системи |
| description | Описані механізми навчання та оцінки якості роботи класифікатора в розроблюваній системі автоматизованої обробки великих об'ємів текстової інформації. Класифікатор базується на вільній бібліотеці LibSVM та методі опорних векторів. Система виконує функції пошуку, класифікації, рубрикації та кластеризації текстових документів за запитами користувача.
Описываются механизмы обучения и оценки качества работы классификатора в разрабатываемой системе автоматизированной обработки больших объемов текстовой информации. Классификатор базируется на свободной программной библиотеке LibSVM и методе опорных векторов. Система выполняет функции поиска, классификации, рубрикации и кластеризации текстовых документов по запросам пользователя.
The mechanisms of teaching and evaluation of the performance of the classifier in the developing system of the automated processing of large volumes of textual information are described. The classifier is based on the free software library LibSVM and support vector machines. The system performs the functions of search, classification, categorization and clustering of text documents at the request of the user
|
| first_indexed | 2025-12-07T13:22:43Z |
| format | Article |
| fulltext |
© Lytvynov V.V., Moyseenko O.P., 2014 53
ISSN 1028-9763. Математичні машини і системи, 2014, № 4
УДК 004.912: 004.632
V.V. LYTVYNOV*, O.P. MOYSEENKO*
THE MECHANISMS OF TEACHING AND EVALUATION OF THE QUALITY OF
PERFORMANCE OF THE TEXT DOCUMENTS CLASSIFIER
*Chernihiv National University of Technology, Chernihiv, Ukraine
Анотація. Описані механізми навчання та оцінки якості роботи класифікатора в розроблюваній
системі автоматизованої обробки великих об'ємів текстової інформації. Класифікатор базуєть-
ся на вільній бібліотеці LibSVM та методі опорних векторів. Система виконує функції пошуку,
класифікації, рубрикації та кластеризації текстових документів за запитами користувача.
Ключові слова: класифікація, рубрикація, кластеризація, обробка текстових документів.
Аннотация. Описываются механизмы обучения и оценки качества работы классификатора в
разрабатываемой системе автоматизированной обработки больших объемов текстовой инфор-
мации. Классификатор базируется на свободной программной библиотеке LibSVM и методе опор-
ных векторов. Система выполняет функции поиска, классификации, рубрикации и кластеризации
текстовых документов по запросам пользователя.
Ключевые слова: классификация, рубрикация, кластеризация, обработка текстовых документов.
Abstract. The mechanisms of teaching and evaluation of the performance of the classifier in the develop-
ing system of the automated processing of large volumes of textual information are described. The classi-
fier is based on the free software library LibSVM and support vector machines. The system performs the
functions of search, classification, categorization and clustering of text documents at the request of the
user.
Keywords: classification, categorization, clusterization, processing of text documents.
1. Introduction
The aim of classification (thematic categorization) of electronic natural language documents, i.e.
classification of the texts content to one or several thematic sections, is currently important due to
the continuous growth of stored or transmitted text data.
In theory, the solution of the documents classification task involves the presence of a cer-
tain plurality of electronic documents D={di}, that has to be separated into several nonintersect-
ing, thematically homogeneous subset (classes, С) and defining to which class each document
from the total mass of documents to be processed should be classified [1].
{C } 0( )
i
i
i i jd CC d D C C i j∀
∈= = × ∩ = ≠∪ . (1)
2. Problem statement
The objects of the research are:
• a relatively large text collection of several hundred documents, previously separated by
content into thematic groups (classes/sub-collections);
• the mechanisms of text data analysis in the system of natural language documents
processing.
The tools are:
• the developed system of processing of multilingual, dynamic flows of text data on the
base of support vector machine algorithm (SVM) [2], implemented in the free library LibSVM;
• implementation of SVM in the module Machine Learning (Support Vectors), the product
of the company StatSoft, STATISTIKA 8.0.
54 ISSN 1028-9763. Математичні машини і системи, 2014, № 4
b
w
0≤ξ<1
ξ>1
k(x,x')=
band width =
2 w wi
Support vector Support vector
not classified document
Fig. 1. The general scheme of work of SVM classifier where,
dots are the vector representation of two thematically different
subsets, pluralities, classified NL documents; k – some function
of the nucleus, that allows to separate thematic classes so that a
separating plane could be drawn; w – support vectors on the
base of bordering documents; ξ – the introduced variable error
to assess the classifier; b – the distance between the separating
pluralities plane and the beginning of coordinates; w – support
vector
As a result of theoretical and practical experiments, it will be possible to investigate more
thoroughly the processes of study and testing of the classifier in the system of “Processing of
high-speed information flows of text data”.
3. Problem solution
By the example of the method of support vector machines (fig. 1), the model of the text docu-
ments classifier can be presented as:
, , , ,R D C F Rc f=< > , (2)
where D – plurality of documents that need to be classified;
C – plurality of thematic rubrics (classes) { }C ci= , 1.. ,i Nc N= – number of possible
rubrics;
F – plurality of rubrics descriptions. Each class Ci has its distinctive description Fi;
Rc – ratio C F× , to check the single description of each rubric.
! : ( , )i i i i cC F F c F Rc ∈ ∃ ∈ ∈∀ ;
f – function | | 1: (d) Cd tC Cd D f ⊂ ∩ >∃ ∈ = , i.e. the process of classification of objects
d Є D in the result of which the
correspondence of a specific doc-
ument d to one of the descriptions
Fi and its assignment to the rubric
Сi are defined. According to this
function, elements of the plurality
of documents can be assigned to
several thematic rubrics at the
same time. To minimize the num-
ber of such cases the classifier has
to be properly taught before usage.
The popular in text data
classification tasks collection of
English short financial and stock
documents Reuters-21578 [3] of
the eponymous information agency
has been used in the research. As it
is seen from the name, this collec-
tion consists of 21 578 documents.
Some of the documents are marked
as not properly categorized, that is
why only 12902 documents are
used in practice. The corpus of
texts is presented in the form of
both txt and xml files. The collec-
tion is a part of the first volume of categorized documents of the information agency Reuters that
is abbreviated as RCV1 (Reuters Corpus Volume 1) [4]. In its original form the set of text docu-
ments of Reuters-21578 includes 135 thematic rubrics, 56 names of organizations, 267 different
personalities and 175 geographical names. The documents are collected in 21 xml-files and are
presented in the following way:
<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-
SET" OLDID="5545" NEWID="2">
ISSN 1028-9763. Математичні машини і системи, 2014, № 4 55
<DATE> date of publication </DATE>
<TOPICS/>
<PLACES>
<D> location </D>
</PLACES>
<PEOPLE/>
<ORGS/>
<EXCHANGES/>
<COMPANIES/>
<UNKNOWN>more information </UNKNOWN>
<TEXT>
<TITLE> topic </TITLE>
<DATELINE> origin </DATELINE>
<BODY>text</BODY>
</TEXT>
</REUTERS>;
For teaching and evaluation of the quality of performance of SVM classifier the method
ModApte split has been used, that involves the separation of documents plurality of Reuters-
21578 collection into the subset for teaching – 9603 documents (74% of the total amount) and the
subset for testing of the chosen method of machine teaching with 3299 documents (26% off the
total amount). ModApte separation is recommended to use to compare results of work of several
classifiers.
4. Experimental part
The developing system is based on one of the variety of existing implementations of the support
vector machines method, namely the free library with nonlinear nucleuses – LibSVM [8, 9]. Pre-
ference to this library to the library of the same developers LibLinear, that is implementing a
quick linear classifier SVM, was given due to the work with small text corpuses and the possibili-
ty of occurrence of the situation of linear inseparability after a change of documents collections
by including documents in other NL. The mechanisms of SVM algorithm implementation in the
program product Statistika are not known. In the available software version there is one module
that implements this method for the tasks of classification and categorization for any text corpus-
es.
The quality of the classifier work depends on the correct presentation of processed docu-
ments in the form of a vector model [10, 11]. Each document from the collection of such model is
presented as a plurality of terms (words, word combinations, numbers and other elements of
which a document consists). According to the mentioned laws of Zipf, a certain weight can be
specified to the terms from the collection, i.e. how important this term is for the document cha-
racteristic. For the presentation of a document in the vector space, the weights of all terms of the
collection in regard to this document are denoted. The dimension of the document vector will be
equal to the total amount of all terms outlined from the collection.
),...,,( 21 njjjj wwwd = , (3)
where jd – vector presentation of j document, ijw – weight of i term in j document, n – total
amount of terms.
Thanks to such presentation of documents they can be compared by finding the distance
between vectors of the space (Euclidean distance or Mahalanobis distance). The smaller the dis-
tance is, the greater probability of thematic similarity between the documents.
In the system of automatic processing of text data flows on the base of LibSVM library
the following functions of nucleus are possible that implement the linear separation of classified
subjects:
56 ISSN 1028-9763. Математичні машини і системи, 2014, № 4
<Label> – nucleus identifier. Examples of functions:
0 – linear )(),( >⋅<= wxsignwxk .
1 – polynomial dxxxxk )(),( ′⋅=′ .
2 – radial basis function, )||||exp(),( 2xxxxk ′−−=′ γ , for 0>γ .
3 – sigmoid )(),( cxkxthxxk +′⋅=′ for 0,0 <> ck б
where К – nucleus function, x·x' – scalar product of vectors, у – mapping of a vector from the
space of features Rn into another space, d – degree, к and с – parameters, w – weights of features.
<Index1>:<Value1>
<Index2>:<Value>
…
Index – number of the vector coordinates, Value – value of the vector.
There are several standard ways of weight determination of a term in a document:
а) Boolean weight – 1, if the term is in the document, and 0 if it doesn’t occur;
b) Term Frequency (TF) – the frequency of the term occurrence in the document;
c) Term Frequency – Inverse Document Frequency (TF-IDF) – the frequency of the term
occurrence in the document at the amount that is inverse to the number of documents in which
this term occurs;
d) Pointwise Mutual Information (PMI) – all negative weights are replaced by zero.
For cleanliness of the experiment in the developed system the tf-idf method of determin-
ing terms weights is used as it is used in the software Statistika [8]:
∑
≠
+
⋅+
=
ks
is
k
ik
ki
N
N
D
N
w
2)1)(log(
))
||
log()log(1(
, (4)
where
kiN – number of occurrence of k term in the i document,
kN – number of occurrence of k
term in all documents, D – number of documents in the collection.
Taking into account the possibility of cases of linear inseparability of classified objects in-
to the equation describing the hyperplane that separates classes of documents in the space D, the
variable error is introduced 0iξ ≥ .
( ) 1i i iy d bω ξ− > −i , (5)
where iy – number equal to 1 in the case the vector di refers to the rubric we are interested in and
-1 if it doesn’t;
w – support vector;
b – boundary value of the distance between the separating hyperplane and the beginning
of the coordinates;
w·di > b ⇒ yi = 1;
w·di < b ⇒ di = −1.
It is supposed that if 0=iξ , there is no error in the document di. If 1>iξ , there is an error
in the document. If 10 << iξ , the object di falls within the band of the separating plane.
The task of the classifier teaching is to solve the issue of optimisation of the function
separating plane using the method of Lagrange [13]:
if at the point x relative minimum of the original objective function is achieved, then un-
der condition there is the equality 0 derivatives with respect to x of the new objective function,
ISSN 1028-9763. Математичні машини і системи, 2014, № 4 57
there exists a set iλ , that at the same point x the minimum of the new objective function is at-
tained, but globally for all x. At that for each iλ the following is true:
either iλ is equal to 0 and the corresponding constraint is not active, or iλ is not equal to
0 and the corresponding constraint is satisfied, but then this is already the equation.
Formulating this task in terms of Lagrange method, it turns out that it is necessary to find
the minimum of ξ,, bw and the maximum of iλ of the function:
1
( ( ) 1) при 0, 0
2 i i i i i i i
i i
С y d bω ω ξ λ ξ ω ξ λ+ − + − − ≥ ≥∑ ∑i i . (6)
If 0>iλ , then the document of the teaching collection id is called the support vector.
After these manipulations the optimized separating hyperplane equation looks as follows:
0i i i
i
y d d bλ − =∑ i , (7)
where id – document to be categorized.
As a numerical evaluation of the classification by both systems, the traditional set of met-
rics for a given issue was used: Accuracy (A), Precision (P) and Recall (R).
The first metric shows the general picture of the classifier performance, calculating the ra-
tio of documents properly distributed by the classifier to total.
%100⋅=
N
M
A , (8)
where М – the amount of correctly classified documents, N – the total amount of documents.
The metric of precision indicates the relation of correctly classified documents to a partic-
ular class and of all documents referred to this class.
%100⋅
+
=
FPTP
TP
P . (9)
The metric of recall is the relation of correctly classified documents to a particular class
and all documents belonging to this class in the test sample.
%100⋅
+
=
FNTP
TP
R . (10)
The formulas of recall and precision metrics are constructed on the basis of contingency
tables compiled for each of the possible classes.
Table 1. Variant of the classifier evaluation
Evaluation of the re-
sults by the classifier
Evaluation of the classification results by an expert
True False
True TP (true-positive) FP (false-positive)
False FN (false-negative) TN (true-negative)
The calculation of recall and precision is conducted separately, not joining them in the
popular metric of F-measure (11), which shows generalized assessment of the classifier perfor-
mance.
58 ISSN 1028-9763. Математичні машини і системи, 2014, № 4
%1002 ⋅
+
⋅⋅=
RP
RP
F . (11)
5. Results
After teaching and test categorization the classifiers of the tested systems showed the following
results.
For the texts corpus of 3299 documents from Reuters-21578 collection the developed sys-
tem based on the free library LibSVM and program product Statistica has given the evaluation.
Table 2. Results of the evaluation
System Accuracy, % Precision, % Recall, %
developed 93 80 94
Statistika 89 75 75
In the table there are average values of the metrics for the developed system with step-by-
step application of nucleus functions mentioned previously and realized in the library LibSVM.
The classifier based on the support vector machines algorithm, implemented in the product of
StatSoft company allows automatically determine the most suitable nucleus function for classifi-
cation of concrete objects, thus the figures obtained are considered as average and optimal for this
classifier.
6. Conclusions
The classifier of the developed system of processing text data flows on the base of free library
LibSVM has shown better results in comparison to the module Machine Learning (Support Vec-
tors) of the system Statistika. This may be caused by both: difference of approaches to texts
processing (markup, normalization) and choice of the nucleus function. It is planned to improve
the classifier performance evaluation on mixed collections.
REFERENCES
1. Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Fea-
tures / T. Joachims // Proc. of ECML-98, 10th European Conference on Machine Learning. – Dortmund,
1998. – P. 137 – 142.
2. Вапник В.Н. Восстановление зависимостей по эмпирическим данным / Вапник В.Н. – М.: Наука,
1979. – 448 с.
3. Коллекция документов Рейтерс [Электронный ресурс]. – Режим доступа:
http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml.
4. Коллекция документов Рейтерс [Электронный ресурс]. – Режим доступа:
http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
5. Новостная коллекция РОМИП [Электронный ресурс]. – Режим доступа:
http://romip.ru/ru/collections/news-collection.html.
6. Емпирические законы Зипфа [Электронный ресурс]. – Режим доступа:
http://artprom.net/article/read/zakon_Zipf.html.
7. Куняев Н.Н. Конфиденциальное делопроизводство и защищенный электронный документообо-
рот / Куняев Н.Н., Демушкин А.С., Фабричнов А.Г. – М.: Логос, 2011. – 452 с.
8. Библиотека LibSVM [Электронный ресурс]. – Режим доступа:
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
9. Литвинов В.В. SVM при классификации мультиязычных текстов / В.В. Литвинов, О.П. Мойсе-
енко // Весник ЧНТУ. – 2013. – № 4. – С 59 – 64.
10. Векторная модель коллекции документов [Электронный ресурс]. – Режим доступа:
http://www.machinelearning.ru/wiki/index.php?title=Векторная_модель.
ISSN 1028-9763. Математичні машини і системи, 2014, № 4 59
11. Нейлор К. Как построить свою экспертную систему / Нейлор К. – М.: Энергоатомиздат, 1991. –
286 с.
12. Боровиков В.П. Программа STATISTICA для студентов и инженеров. – [2-е изд.]. – М.: Компь-
ютерПресс, 2001. – 301 c.
13. Лифшиц Ю. Метод опорных векторов (Слайды) — лекция № 7 из курса «Алгоритмы для Ин-
тернета» [Электронный ресурс]. – Режим доступа: yury.name/internet/07iah.pdf.
14. Крулькевич М.И. Информационная деятельность в организациях / М.И. Крулькевич, Е.М. Сын-
кова. – Донецк: ДонНУ Украины, 2001. – 176 с.
Стаття надійшла до редакції 20.08.2014
|
| id | nasplib_isofts_kiev_ua-123456789-84448 |
| institution | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
| issn | 1028-9763 |
| language | English |
| last_indexed | 2025-12-07T13:22:43Z |
| publishDate | 2014 |
| publisher | Інститут проблем математичних машин і систем НАН України |
| record_format | dspace |
| spelling | Lytvynov, V.V. Moyseenko, O.P. 2015-07-08T13:10:07Z 2015-07-08T13:10:07Z 2014 The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier / V.V. Lytvynov, O.P. Moyseenko // Математичні машини і системи. — 2014. — № 4. — 53-59. — Бібліогр.: 14 назв. — англ. 1028-9763 https://nasplib.isofts.kiev.ua/handle/123456789/84448 004.912: 004.632 Описані механізми навчання та оцінки якості роботи класифікатора в розроблюваній системі автоматизованої обробки великих об'ємів текстової інформації. Класифікатор базується на вільній бібліотеці LibSVM та методі опорних векторів. Система виконує функції пошуку, класифікації, рубрикації та кластеризації текстових документів за запитами користувача. Описываются механизмы обучения и оценки качества работы классификатора в разрабатываемой системе автоматизированной обработки больших объемов текстовой информации. Классификатор базируется на свободной программной библиотеке LibSVM и методе опорных векторов. Система выполняет функции поиска, классификации, рубрикации и кластеризации текстовых документов по запросам пользователя. The mechanisms of teaching and evaluation of the performance of the classifier in the developing system of the automated processing of large volumes of textual information are described. The classifier is based on the free software library LibSVM and support vector machines. The system performs the functions of search, classification, categorization and clustering of text documents at the request of the user en Інститут проблем математичних машин і систем НАН України Математичні машини і системи Інформаційні і телекомунікаційні технології The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier Механізми навчання і оцінки якості роботи класифікатора текстових документів Механизмы обучения и оценки качества работы классификатора текстовых документов Article published earlier |
| spellingShingle | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier Lytvynov, V.V. Moyseenko, O.P. Інформаційні і телекомунікаційні технології |
| title | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| title_alt | Механізми навчання і оцінки якості роботи класифікатора текстових документів Механизмы обучения и оценки качества работы классификатора текстовых документов |
| title_full | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| title_fullStr | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| title_full_unstemmed | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| title_short | The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| title_sort | mechanisms of teaching and evaluation of the quality of performance of the text documents classifier |
| topic | Інформаційні і телекомунікаційні технології |
| topic_facet | Інформаційні і телекомунікаційні технології |
| url | https://nasplib.isofts.kiev.ua/handle/123456789/84448 |
| work_keys_str_mv | AT lytvynovvv themechanismsofteachingandevaluationofthequalityofperformanceofthetextdocumentsclassifier AT moyseenkoop themechanismsofteachingandevaluationofthequalityofperformanceofthetextdocumentsclassifier AT lytvynovvv mehanízminavčannâíocínkiâkostírobotiklasifíkatoratekstovihdokumentív AT moyseenkoop mehanízminavčannâíocínkiâkostírobotiklasifíkatoratekstovihdokumentív AT lytvynovvv mehanizmyobučeniâiocenkikačestvarabotyklassifikatoratekstovyhdokumentov AT moyseenkoop mehanizmyobučeniâiocenkikačestvarabotyklassifikatoratekstovyhdokumentov AT lytvynovvv mechanismsofteachingandevaluationofthequalityofperformanceofthetextdocumentsclassifier AT moyseenkoop mechanismsofteachingandevaluationofthequalityofperformanceofthetextdocumentsclassifier |