Алгоритм паралельного пошуку для документів, описаних формальною граматикою
In this paper, we developed a concurrent generic heuristic algorithm for parallel parsing and searching in structured text datasets. The main objective of the algorithm was to increase an efficiency of central processing unit dependent operations when parsing large-scale datasets by using a parallel...
Збережено в:
| Дата: | 2018 |
|---|---|
| Автор: | |
| Формат: | Стаття |
| Мова: | Англійська |
| Опубліковано: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2018
|
| Теми: | |
| Онлайн доступ: | https://journal.iasa.kpi.ua/article/view/139915 |
| Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
| Назва журналу: | System research and information technologies |
| Завантажити файл: | |
Репозитарії
System research and information technologies| _version_ | 1867334341898534912 |
|---|---|
| author | Prodan, Anastasiia O. |
| author_facet | Prodan, Anastasiia O. |
| author_institution_txt_mv | [
{
"author": "Anastasiia O. Prodan",
"institution": "National Technical University of Ukraine \"Igor Sikorsky Kyiv Polytechnic Institute\", Kyiv"
}
] |
| author_sort | Prodan, Anastasiia O. |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2019-04-26T15:57:21Z |
| description | In this paper, we developed a concurrent generic heuristic algorithm for parallel parsing and searching in structured text datasets. The main objective of the algorithm was to increase an efficiency of central processing unit dependent operations when parsing large-scale datasets by using a parallel approach. The developed algorithm uses heuristics to find requested data without needing to process the whole file and without syntax tree building. It can be applied to any data formats. An increase in efficiency was discovered when input-output operations take significantly less time than the process of searching, the file is loaded into random access memory or when an efficient non-sequential access to file is possible. We also developed a prototype implementation of the algorithm for use in performance comparisons. The prototype supports searching in large-scale XML datasets using a subset of XPath expressions to specify search request. Our experimental results show that the developed algorithm is faster than classical algorithms, when all the requirements are met and the desired data is located closer to the beginning of the dataset. In worst cases, our algorithm gives nearly the same results as the others, but consumes more memory. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2018.4.05 |
| first_indexed | 2025-07-17T10:24:03Z |
| format | Article |
| fulltext |
Anastasiia Prodan, 2018
58 System Research & Information Technologies, 2018, № 4
УДК 004.021
DOI: 10.20535/SRIT.2308-8893.2018.4.05
A PARALLEL SEARCH ALGORITHM
FOR FORMAL GRAMMAR DATA TYPES
ANASTASIIA PRODAN
Abstract. In this paper, we developed a concurrent generic heuristic algorithm for
parallel parsing and searching in structured text datasets. The main objective of the
algorithm was to increase an efficiency of central processing unit dependent opera-
tions when parsing large-scale datasets by using a parallel approach. The developed
algorithm uses heuristics to find requested data without needing to process the whole
file and without syntax tree building. It can be applied to any data formats. An in-
crease in efficiency was discovered when input-output operations take significantly
less time than the process of searching, the file is loaded into random access memory
or when an efficient non-sequential access to file is possible. We also developed a
prototype implementation of the algorithm for use in performance comparisons. The
prototype supports searching in large-scale XML datasets using a subset of XPath
expressions to specify search request. Our experimental results show that the devel-
oped algorithm is faster than classical algorithms, when all the requirements are met
and the desired data is located closer to the beginning of the dataset. In worst cases,
our algorithm gives nearly the same results as the others, but consumes more memory.
Keywords: grammar, search, parallelism, concurrency, heuristics.
INTRODUCTION
Nowadays there are a lot of digital data representation formats, many of them are
broadly used in almost all fields of human’s interest. Recent achievements of the
information technology have changed the meaning of the information in business
and everyday life. The efficiency of the data storing methods and the speed of
data search have become a valuable advantage.
Most commonly used search methods are divided into three common parts:
1) parsing,
2) decoding,
3) searching.
Parsing refers to lexical and syntax analysis. Input for this stage is raw text
data, and abstract syntax tree is the output. For context-free LL(1) grammars,
lexical analysis can be done using the state machine. Tokenized text is then
processed by one of the forward recursive parsing algorithms.
Decoding refers to semantic analysis. Input for this stage is abstract syntax
tree, and the output depends on file format and decoding engine. For XML-based
data types document object model has to be built. For JSON and other data types
the output of decoding stage is not standardized and depends on the current appli-
cation. Data is represented in graph or tree structure.
Search stage can be started only after all the previous stages are completed.
On the search stage, we traverse the inner structure of the decoded data and return
part of the data, if it matches all the search criteria.
A parallel search algorithm for formal grammar data types
Системні дослідження та інформаційні технології, 2018, № 4 59
To increase speed and efficiency of the search, indexing is used. Indexing al-
lows search algorithm to go directly to the searched data, skipping first two
stages. The problem is that full index is not always available. The process of in-
dexing requires much time, memory and storage space, so it is redundant when
we need to process the file only once, or when we do not have enough storage
space to store full index. Partial index can only speed up some simple queries, but
is useless for complex ones, so search system has to fall back to the first algo-
rithm, that is less efficient.
ALGORITHM
For the cases, where full indexing is not possible or not necessary, we developed
our concurrent heuristic search algorithm. The inputs for this algorithm are search
query and raw text data, and the output is the found data.
Classical way of increasing calculations speed is to run them in parallel.
Using high parallel approach, we can process large files faster than sequentially,
but we should be able to read file non-sequentially. The general algorithm scheme
is shown at fig. 1.
Main steps of the algorithm are listed below:
1. Split file into n fixed-size buffers.
2. Run k parser threads, where nk . Each parser thread process a buffer
sequentially, from the beginning.
3. When thread has completed the processing of the buffer, it consumes next
buffer from the remaining queue.
4. If all the search criteria are satisfied by one of the threads and all the
previous buffers, data is found. Return the data.
5. When the queue is empty, return failure.
The following data structures are required for this algorithm:
1. Currently used buffers.
2. Buffer queue.
3. Results list.
Currently used buffers store raw text data that is being processed by the
parser thread. Size of the buffer is fixed. To calculate optimal size of the buffer,
we have to consider limitations of maximum available memory and minimal
processing unit size in formal grammar representation.
Buffer queue stores pointers to data that is not yet available for reading, and
has to be loaded into one of the buffers yet.
Results list is a simple list that contains data structures of the special type -
result. Every time a worker thread finishes processing of a buffer, it has to put a
result into the results list, so decoded information will be available for all the
other threads. The structure of a result will be described above.
The main process of lexical and syntax analysis is executed in parser threads.
Each parser thread executes the following set of operations:
lexical analysis;
structural analysis;
data search;
result updating.
Anastasiia Prodan
ISSN 1681–6048 System Research & Information Technologies, 2018, № 4 60
Parsing is based on a state machine, that reads input symbols one by one and
changing its state. We use simple pushdown automata for parsing, because it al-
lows passing the context from one buffer to the next one and allows speculative
parsing. Parsing process is shown at fig. 2.
There are two possible ways for a thread to start processing a buffer. If pre-
vious buffer was already processed or if it is the first buffer from the beginning of
the dataset, we can use the information from previous buffer to determine, what
was the previous state and at what state we are beginning.
If we do not know about previous buffer, we still can process the current
buffer. We create multiple state machines, one for every possible state. On each
step every state machine receives the next character as its input data. For each
state machine, if it receives incorrect data and enters error state, it is destroyed.
For simple context-free grammars, this process allows to distinguish only one or
two possible states without knowing about results of parsing previous buffers.
BEGIN
queue;
query
spawn processes
parse
found?
…
queue is
empty?
set result success
set result success
END
+
parse
found? +– –
+–
Fig. 1. General search algorithm structure
A parallel search algorithm for formal grammar data types
Системні дослідження та інформаційні технології, 2018, № 4 61
Search process is executed every time a token is parsed. All the non-terminal
character sequences are being tested against the search query. For the first token,
if it is not a terminal character, not complete match is also allowed, if it matches
the end of a search query. The same applies to the last token, if it matches the be-
ginning of a search query. Match shows that it is possible for this buffer to con-
tain searched data.
After all of the buffer content is processed, and after the result is pushed to
the result list, thread has to check if search was successfully completed. This step
BEGIN
initial state
matches
Non-terminal
sequence end
end of
buffer
match with
search query
report
success
BEGIN
next
character
…
is character
possible?
update stateerror state
– +
– +
– +
– +
Fig. 2. Parsing process
Anastasiia Prodan
ISSN 1681–6048 System Research & Information Technologies, 2018, № 4 62
is only taken if a thread has a full match or an incomplete match at the beginning.
The thread has to wait for all the preceding buffers to be processed. If all the
search criteria are satisfied by the previous buffers, the thread stops all the other
parser threads and returns successful search result. A detailed scheme is
shown at fig. 3.
Result is a data structure, that is able to store all the state machines, that sur-
vived (have never entered the error state) during the parsing operation of a buffer.
Stack of every state machine for storing uncompleted non-terminal character se-
BEGIN
query list;
string
pop next
query
end of string
match with
string
end of query
go to next char
report partial
match
BEGIN
has next
query
report full
match
– +
– +
– +
– +
Fig. 3. Full text search process
A parallel search algorithm for formal grammar data types
Системні дослідження та інформаційні технології, 2018, № 4 63
quences also should be stored inside result data structure, so it can be used when
the next buffer is being processed or after the next buffer has been processed. This
way of sharing the state between partitions, processed in parallel, allows us to
keep the hierarchy.
Also result data structure has to store the results of search query matching.
This information can be represented as a list of discriminated unions (with ele-
ment count equals to search criteria count) with four possible states:
no match;
full match;
partial match (beginning);
partial match (ending).
For partial match cases, part of the string that is matched should also be in-
cluded. For the ending match, we need to keep only index of the match string,
because the string itself is already stored in the stack of the state machine.
IMPLEMENTATION FOR XML
To make a research and get the experimental results, we developed an algorithm
implementation for the XML language. To define a minimal subset of XML,
which can be used for algorithm testing, we need to select element types that are
supported in our XML grammar subset [1]. Supported elements are shown in the
Table 1.
Table 1. Supported elements in XML implementation
N Element type Example
1 Node <node>
2 Text node Text node
3 Attribute attribute=”value”
4 Closing node </node>
At first, we define a finite state machine for the XML grammar, considering
our imitations to keep it simple enough for testing purposes. Then, we define a
search query. In many applications, XPath query language [2] is used to define
the data to be found. To keep the example implementation simple, only one axis
and only one search method of XPath will be used – forward traverse with full
text search. The developed algorithm can only be efficient on forward axis, be-
cause it processes file in forward direction, from the beginning to the end.
In the example (shown in fig. 4) we use only one search query, a text node
with full text matching criteria. The following figure shows the process of search
for a text “Ola Nordmann” inside the sample XML dataset. The input dataset is a
regular XML document with typical hierarchical structure, where data is repre-
sented as text nodes. Text that satisfies the search criteria is located at the begin-
ning of the file, so the algorithm does not need to process the entire file to the end.
After the first processing stage, XML document is split into 15 buffers, con-
taining 30 characters each. It is still just raw XML document, but every buffer is
processed independently in parallel by some worker threads, beginning from the
first buffer.
Anastasiia Prodan
ISSN 1681–6048 System Research & Information Technologies, 2018, № 4 64
After the second execution stage, data is represented as “result” data struc-
tures. Each structure contains the stack for a state machine (upper text field) and
its state, described in lower text field. Also it contains a match flag. Full text
matching with text node started in the third buffer, and completed in the fourth
buffer, so there is no need to continue the search process.
Only four buffers were processed, ant it is enough to find the data. If we use
four threads, because four cores (physical or virtual) is a popular solution for
desktop processors, we can achieve the result of completed search in just one run.
It could be almost four times faster than if it is done sequentially. If we are using
classic non-parallel parsing algorithm, we need to process all the dataset, all 15
buffers, so developed algorithm can be almost 15 times faster in this particular
case in theory.
EXPERIMENTAL RESULTS
For the testing purposes, classical search algorithm from the default .NET platform
XML library will be used to compare efficiency of the algorithms on different
sizes of the dataset.
Result of the experiment can be seen in the Table 2. Graphically experimen-
tal results are shown at the fig. 5. As we can see, the developed parallel search
algorithm is faster on the larger datasets, because it works in parallel and it does
not need to read the entire file to the end, if the searched node is found.
If the dataset is growing linearly, time spend for the search process is growing
linearly as well for both algorithms.
T a b l e 2 . Results of the experiment
N. File size, Mb
Time for the developed
algorithm, ms
Time for the library
algorithm, ms
1 100 137 108
2 500 242 319
3 1000 398 480
4 2000 514 827
5 5000 843 1746
Fig. 4. XML processing
A parallel search algorithm for formal grammar data types
Системні дослідження та інформаційні технології, 2018, № 4 65
We generated a sample XML dataset that includes all of the supported ele-
ment types. Each name or value field length is between 2 and 20 characters. Ele-
ment that is being searched is always present in the dataset. Location of the ele-
ment is randomly generated on each test run.
Classical algorithm can achieve better results on small datasets, because it
uses multiple optimizations and more efficient parsing algorithms, than our sam-
ple implementation. Also, our algorithm requires a complex initialization stage,
with multiple data structures allocation and multiple threads initialization.
For growing datasets, developed algorithm requires less time to find matching
data, but there are several conditions that should be met. Also the speed of the
developed algorithm depends on searchable element’s position. In the worst case
possible for the algorithm, when the searched element is not present in the dataset
and the entire file should be processed, our algorithm still works faster because of
its highly parallel nature, but the difference is less significant.
RELATED WORK
Many methods of parallel text data processing have been presented. Many of
them are applicable only to XML processing, especially for XML parsing. Paral-
lel approach is well known in finite automata based parsing methods [3] specula-
tive parsing methods [4]. Most of parallel approaches can be used for context-
independent LL(1) formal grammars. XML grammar is a subset of LL(1) gram-
mar, but with some unique differences. Parallel XML processing includes parsing,
syntax tree building and XML graph tree building [5]. Modern XML processing
methods use special optimization techniques, applicable only to XML format [6].
Parallel depth-first search is widely used in highly efficient searching sys-
tems [7]. In this paper, we developed a search algorithm, based on parallel search
methods and concurrent formal grammars processing methods for efficient search
in any grammar that differ from the existing methods by using the combination of
parallel parsing and parallel search in one run.
1500
1000
500
1000 2000 3000 4000 5000
2
1
ti
m
e.
m
s
dateset, Mb
Fig. 5. Experimental results: 1 — library; 2 — developed
Anastasiia Prodan
ISSN 1681–6048 System Research & Information Technologies, 2018, № 4 66
CONCLUSION
We presented a concurrent algorithm for search in text documents, represented
using formal context-independent grammars. The developed algorithm was tested
on subset of XML grammar using XPath as query language grammar. Experimen-
tal results show that good speeding up was achieved for large-scale datasets.
Speed up is only possible under some constraints. If the speed of read opera-
tions is limited by the speed of the disk, and parallel reading is not possible, there
is no reason of using highly parallel approach. The developed algorithm will only
slow down the process, because of thread management and context switching.
Also it is recommended limiting thread count to be less or equal to the physical
processing units, to make use of real parallel execution and reduce number of
CPU cache misses. Optimal buffer size depends on multiple factors, from CPU
cache size and random access memory available to the data representation format,
formal grammar and most common text node sizes. To get the best results from
using the developed algorithm, it is recommended to configure these parameters
manually for each application to meet its requirements.
In comparison to the commonly used search methods, this concurrent heuris-
tic search method demonstrates higher efficiency in terms of execution time, but
uses more memory and utilizes more system resources.
To improve the performance characteristics of the algorithm other parsing
methods can be used. Dynamic buffer sizes, used alongside with special data
splitting algorithm, capable of splitting the dataset by terminal characters, would
be great improvement to the developed search algorithm.
REFERENCES
1. Extensible Markup Language (XML) 1.0 (Third Edition). — Available at: http:
//www.w3.org/TR/2004/REC-xml-20040204/. — 2004.
2. Clark J. XML Path Language (XPath) Version 1.0. / J. Clark, S. DeRose. — Avail-
able at: https://www.w3.org/TR/1999/REC-xpath-19991116/. — 1999.
3. Chang J.H. Parallel Parsing on a One-Way Array of Finite-State Machines /
J.H. Chang, O.H. Ibarra, M.A. Palis. — 1987. — P. 64–75.
4. Veillard D. Libxm12 project web page / D. Veillard. — Available at: http://
xmlsoft.org/. — 2004.
5. Chiu K. A compiler-based approach to schema-specific xml parsing / K. Chiu,
W. Lu. — Available at: https://www.researchgate.net/publication/228586122_
A_compiler-based_approach_to_schema-specific_XML_parsing. — 2004.
6. Noga M.L. Lazy xml processin / M.L. Noga, S. Schott, W. Lowe. — 2002. — P. 4–7.
7. Rao V.N. Parallel depth first search. part 1. Implementation / V.N. Rao and
V. Kumar. — 1987. — P. 15–21.
Received 01.08.2018
From the Editorial Board: the article corresponds completely to submitted
manuscript.
|
| id | journaliasakpiua-article-139915 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2025-07-17T10:24:03Z |
| publishDate | 2018 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/5c/34e04601049e343d35cae2087a4a0d5c.pdf |
| spelling | journaliasakpiua-article-1399152019-04-26T15:57:21Z A parallel search algorithm for formal grammar data types Алгоритм параллельного поиска для документов, описанных формальной грамматикой Алгоритм паралельного пошуку для документів, описаних формальною граматикою Prodan, Anastasiia O. grammar search parallelism concurrency heuristics грамматика поиск параллелизм параллельность эвристика граматика пошук паралелізм паралельність евристика In this paper, we developed a concurrent generic heuristic algorithm for parallel parsing and searching in structured text datasets. The main objective of the algorithm was to increase an efficiency of central processing unit dependent operations when parsing large-scale datasets by using a parallel approach. The developed algorithm uses heuristics to find requested data without needing to process the whole file and without syntax tree building. It can be applied to any data formats. An increase in efficiency was discovered when input-output operations take significantly less time than the process of searching, the file is loaded into random access memory or when an efficient non-sequential access to file is possible. We also developed a prototype implementation of the algorithm for use in performance comparisons. The prototype supports searching in large-scale XML datasets using a subset of XPath expressions to specify search request. Our experimental results show that the developed algorithm is faster than classical algorithms, when all the requirements are met and the desired data is located closer to the beginning of the dataset. In worst cases, our algorithm gives nearly the same results as the others, but consumes more memory. Разработан параллельный общий эвристический алгоритм для параллельного анализа и поиска в наборах структурированных текстовых данных. Основная цель алгоритма — повышение эффективности зависимых от центрального процессора операций для анализа крупномасштабных наборов данных с использованием параллельного подхода. Разработанный алгоритм использует эвристику для поиска запрашиваемых данных без необходимости обрабатывать весь файл и без создания синтаксиса. Его можно применять к любым форматам данных. Повышение эффективности обнаруживается, когда операции ввода–вывода занимают значительно меньше времени, чем процесс поиска, а файл загружается в оперативное запоминающее устройство, или когда возможен эффективный непоследовательный доступ к файлу. Разработан также прототип реализации алгоритма для использования при сравнении производительности. Прототип поддерживает поиск в крупномасштабных наборах данных XML с использованием подмножества выражений XPath для указания запроса на поиск. Экспериментальные результаты показывают, что разработанный алгоритм быстрее, чем классические при условии выполнения соответственных требований и расположения данных ближе к началу набора. В худшем случае разработанный алгоритм покажет почти такие результаты, что и другие, но требует больше памяти. Розроблено паралельний загальний евристичний алгоритм для паралельного аналізу і пошуку в наборах структурованих текстових даних. Основна мета алгоритму — підвищення ефективності залежних від центрального процесора операцій для аналізу великомасштабних наборів даних з використанням паралельного підходу. Розроблений алгоритм використовує евристику для пошуку даних за запитом без необхідності обробляти весь файл і без створення синтаксису. Його можна застосовувати до будь-яких форматів даних. Підвищення ефективності виявляється, коли операції введення-виведення займають значно менше часу, ніж процес пошуку, а файл завантажується в оперативний запам’ятовувальний пристрій, або коли можливий ефективний непослідовний доступ до файлу. Розроблено також прототип реалізації алгоритму для застосування у разі порівняння продуктивності. Прототип підтримує пошук у великомасштабних наборах даних XML з використанням підмножини виразів XPath для указання запиту на пошук. Експериментальні результати показують, що розроблений алгоритм швидший за класичні за умови виконання відповідних вимог розташування даних ближче до початку набору. У гіршому випадку алгоритм покаже майже такі результати, що й інші, але потребує більше пам'яті. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2018-12-18 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/139915 10.20535/SRIT.2308-8893.2018.4.05 System research and information technologies; No. 4 (2018); 58-66 Системные исследования и информационные технологии; № 4 (2018); 58-66 Системні дослідження та інформаційні технології; № 4 (2018); 58-66 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/139915/151392 Copyright (c) 2021 System research and information technologies |
| spellingShingle | граматика пошук паралелізм паралельність евристика Prodan, Anastasiia O. Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title | Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title_alt | A parallel search algorithm for formal grammar data types Алгоритм параллельного поиска для документов, описанных формальной грамматикой |
| title_full | Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title_fullStr | Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title_full_unstemmed | Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title_short | Алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| title_sort | алгоритм паралельного пошуку для документів, описаних формальною граматикою |
| topic | граматика пошук паралелізм паралельність евристика |
| topic_facet | grammar search parallelism concurrency heuristics грамматика поиск параллелизм параллельность эвристика граматика пошук паралелізм паралельність евристика |
| url | https://journal.iasa.kpi.ua/article/view/139915 |
| work_keys_str_mv | AT prodananastasiiao aparallelsearchalgorithmforformalgrammardatatypes AT prodananastasiiao algoritmparallelʹnogopoiskadlâdokumentovopisannyhformalʹnojgrammatikoj AT prodananastasiiao algoritmparalelʹnogopošukudlâdokumentívopisanihformalʹnoûgramatikoû AT prodananastasiiao parallelsearchalgorithmforformalgrammardatatypes |