Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей
Recent advances in large language models (LLMs) enable the automated discovery of semantic structures and emerging signals within text streams, offering an opportunity to redesign foresight workflows into continuous, data-driven systems. This study aims to develop and validate an automated framework...
Saved in:
| Date: | 2026 |
|---|---|
| Main Authors: | , , , |
| Format: | Article |
| Language: | English |
| Published: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2026
|
| Subjects: | |
| Online Access: | https://journal.iasa.kpi.ua/article/view/365265 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | System research and information technologies |
| Download file: | |
Institution
System research and information technologies| _version_ | 1869472196139679744 |
|---|---|
| author | Lupenko, Serhii Stoliar, Mykhailo Terentiev, Oleksandr Savastiyanov, Volodymyr |
| author_facet | Lupenko, Serhii Stoliar, Mykhailo Terentiev, Oleksandr Savastiyanov, Volodymyr |
| author_institution_txt_mv | [
{
"author": "Serhii Lupenko",
"institution": "Opole University of Technology, Opole"
},
{
"author": "Mykhailo Stoliar",
"institution": "National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv"
},
{
"author": "Oleksandr Terentiev",
"institution": "National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv"
},
{
"author": "Volodymyr Savastiyanov",
"institution": "National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv"
}
] |
| author_sort | Lupenko, Serhii |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2026-06-30T06:14:59Z |
| description | Recent advances in large language models (LLMs) enable the automated discovery of semantic structures and emerging signals within text streams, offering an opportunity to redesign foresight workflows into continuous, data-driven systems. This study aims to develop and validate an automated framework for extracting, structuring, and comparing semantic ontologies using LLMs. The paralyzed approach was used for data mining from social media platforms and filtering non-domain data. The key semantic elements, goals and hypernyms corresponded, were extracted using multiple LLM configurations, with a consensus mechanism to provide semantic reliability and minimize hallucination. The extracted elements were embedded in a high-dimensional vector space, clustered iteratively using cosine similarity, and merged hierarchically. Convergence process and structural stability were analyzed using the elbow criterion and similarity metrics. The Proposed approach provides a cost-efficient alternative to traditional expert-based foresight analysis. By integrating LLM-driven semantic extraction with quantitative clustering, it enables the identification of emerging trends, weak signals, and long-term thematic structures. The results highlight the potential of LLM-based semantic modeling as a foundation for automated foresight systems. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2026.2.09 |
| first_indexed | 2026-07-01T01:00:18Z |
| format | Article |
| fulltext |
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov, 2026
134 ISSN 1681–6048 System Research & Information Technologies, 2026, № 2
TIÄC
МЕТОДИ, МОДЕЛІ ТА ТЕХНОЛОГІЇ
ШТУЧНОГО ІНТЕЛЕКТУ В СИСТЕМНОМУ
АНАЛІЗІ ТА УПРАВЛІННІ
UDC 004.9: 303.732.4
DOI: 10.20535/SRIT.2308-8893.2026.2.09
AUTOMATED SEMANTIC ONTOLOGY
CONSTRUCTION FOR FORESIGHT STUDIES USING
LARGE LANGUAGE MODELS
S.A. LUPENKO, M.V. STOLIAR, O.M. TERENTIEV, V.V. SAVASTIYANOV
Abstract. Recent advances in large language models (LLMs) enable the automated
discovery of semantic structures and emerging signals within text streams, offering
an opportunity to redesign foresight workflows into continuous, data-driven systems.
This study aims to develop and validate an automated framework for extracting, struc-
turing, and comparing semantic ontologies using LLMs. The paralyzed approach was
used for data mining from social media platforms and filtering non-domain data.
The key semantic elements, goals and hypernyms corresponded, were extracted using
multiple LLM configurations, with a consensus mechanism to provide semantic reli-
ability and minimize hallucination. The extracted elements were embedded in a high-
dimensional vector space, clustered iteratively using cosine similarity, and merged
hierarchically. Convergence process and structural stability were analyzed using the
elbow criterion and similarity metrics. The Proposed approach provides a cost-effi-
cient alternative to traditional expert-based foresight analysis. By integrating
LLM-driven semantic extraction with quantitative clustering, it enables the identifi-
cation of emerging trends, weak signals, and long-term thematic structures.
The results highlight the potential of LLM-based semantic modeling as a foundation
for automated foresight systems.
Keywords: foresight, large language models, semantic ontology, scenario analysis,
weak signals, hierarchical clustering.
INTRODUCTION
In recent years, the growing complexity of global events and technological trans-
formations has significantly increased the need for systematic foresight – the pro-
cess of identifying, analyzing and interpreting trends and weak signals to possible
futures [1]. Traditionally, foresight relies on expert discussions and panels, scenario
workshops, and Delphi studies to capture and structure collective expectations
about the future. While such methods provide deep contextual insights, they are
slow, costly and difficult to scale, when applied to fast changing information envi-
ronments. In other words, by the time you have an answer, the world has already
moved on.
At the same time, we can observe new things, that millions of people are talk-
ing, arguing, and planning in real-time on platforms like Telegram, Facebook,
X and others. These public conversations are a perfect data streams for anyone,
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 135
who is trying to spot the next big thing. You can see new ideas forming in real-
time. These data streams represent an opportunity for automated, data-driven fore-
sight, but extracting meaningful structures from unstructured text (with a lot of
spans, multiple languages, and it is full of noise) makes a challenges.
The construction of ontologies for decision support is a well-established field.
A lot of scientist works in this domain [2–4]. Many previous studies have employed
a “knowledge engineering” approach, focusing on manually constructing scenario-
based ontologies to conceptualize complex processes [5]. These ontologies are then
used to build knowledge graphs that support data integration and simulation, guid-
ing the development of more efficient data provisioning systems [6]. This ontology-
driven method has important for improving problem understanding and designing
effective optimization workflows.
Recent progress in generative models has provided a potential solution for
scalability challenge and gives a chance to work with the BigData. The emergence
of high-capability Large Language Models (LLMs), including GPT-3 [7], GPT-4
[8], Gemini [9], Grock [10] has significantly advanced AI's capacity for complex
reasoning. While much of the world has focused on their role in chatbots or auton-
omous agents, this paper explores their potential for a different, critical task: auto-
mated knowledge discovery. We investigate how the advanced understanding, rea-
soning, and generative power of LLMs can be leveraged to build the complex
knowledge representations – the ontologies and graphs, that are essential for sys-
tematic foresight.
However, a traditional approach faces a significant bottleneck: it is limited by
its dependence on subject-oriented, interdisciplinary human expertise. Constructing
these ontologies is a laborious, manual process, making it difficult to scale or adapt
to new, rapidly evolving challenges. We need a fundamental shift: from manual
knowledge encoding to automated knowledge discovery. This is precisely where
our work begins.
This study aims to develop an LLM-driven framework for automated extrac-
tion and hierarchical organization of collective goals from large-scale social data
streams. It helps to organize unstructured data in hierarchical components. By using
the reasoning capabilities of LLMs, with proper technics like prompt engineering,
consensus decision-making, we try to approximate or even replace certain stages
of expert analysis in the foresight process. Our goal is to build approach, that can
identify, compare, and structure goal-related concepts with cost efficiency, tem-
poral flexibility, and cross-lingual robustness, while maintaining interpretability
suitable for foresight studies.
To show how it works in real world challenges, we took a massive dataset:
three years of posts (2022–2025) from telegram-channel “Victory Drones” [11].
It is a key Telegram channel that discusses about military tech. This is popular the-
matic channel in Ukraine which also has to be used for future country development.
Here are a lot of deep and interesting thoughts about implemented radio technolo-
gies. The data were parsed using asynchronous distributed pipelines with Python
language and preprocessed to remove advertising and non-relevant posts. Texts
were grouped by days and hours to study both long-term and short-term semantic
goal dynamics. The first one, using multiple LLMs, goal candidates were extracted,
semantically represented as vectors. The second one, embedding (numerical vec-
tors that represent text data) iteratively clustered into hierarchical ontologies based
on similarity metrics by gradient methods. These steps were repeated, allowing us
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 136
to compare ontologies constructed from daily and hourly data using structural sim-
ilarity measures. All of it leads to the creation of long-term future images or draft
scenarios as a base component for foresight studies.
METHODOLOGY
The methodology formalizes a four-stage analytical pipeline designed to translate
high-unstructured textual data into dynamic foresight ontology. As illustrated in the
workflow diagram on Fig. 1, the system integrates data validation, semantic struc-
turing, and temporal analysis, underlying the advanced reasoning capabilities
of LLM with combinations of classic automatization approaches to automate pro-
cesses traditionally reserved for human experts.
Fig. 1. Workflow of the four-stage analytical pipeline
DATA
The data set for this study was mined from Telegram messenger. The main channel
is “Victory Drones” [11]. This source was purposively selected based on several
criteria, which are critical for foresight analysis of domain area. “Victory Drones”
[11] is the most popular channel specializing in military communication technolo-
gies, electronic warfare, and unmanned aerial systems. It is high engagement and
expert-driven content provide a rich source of emerging terminology, technical dis-
cussions, and “weak signals” – indicators of future technological shifts. All of it
does it a proper material for ontology construction and foresight study.
The collection period starts at October 2022 and ends September 2025, provid-
ing a long-range view of context evolution in this domain.
Data mining was provided by using Python libraries, with Telethon library as
the core instrument for work with Telegram channels [12]. Asynchronous data pars-
ing and distributed collection scripts were implemented to manage the large data
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 137
volume and take count API rate limits [13]. For each post, we extracted: full text
content (the data for textual analysis), publication timestamp (the critical metadata
for all temporal analysis), unique post ID (for data integrity and deduplication),
associated metadata (types of media attached as images, videos, likes, views, num-
ber of comments etc.).
A base linguistic analysis confirmed the complex multilingual structure of
data corpus, with a significant presence of Ukrainian, Russian, and English texts.
This aspect shows the international value of text, but it opens additional challenges
for natural language processing [14].
To guarantee relevance of the dataset, we apply next filters:
1. Source Verification: only posts originating directly from the “Victory Drones”
[11] channel administrator were retained. All forwarded messages from other channels
or user comments were discarded to maintain a consistent and reduce noise.
2. Content Filtering: non-substantive posts, such as cross-promotional adver-
tising, administrative announcements (example channel rules), and simple “thank
you” messages, were identified and removed to focus the corpus on high-signal,
domain-specific content.
3. Ethical Sourcing: all data collected was from a publicly accessible channel,
reducing the need for user authentication and mitigating major privacy concerns.
No private user data was accessed.
Finally, the dataset was grouped by time intervals to enable temporal analysis
at multiple resolutions and validation of results: daily grouping for long-term trend
and ontology construction, hourly grouping for granular, short-term dynamic anal-
ysis. The total number of daily texts is approximately 1000 observations. The same
number of observations hourly grouping interval for 3rd quarter 2025.
APPROACH
The overall process of semantic structure formation is represented as a system 𝑆 (1)
to formalize the construction of a dynamic ontology from unstructured text. This
model provides a scalable framework for managing the complex dependencies be-
tween text data, extracted meaning, structural relationships and temporal evolution:
𝑆 = < 𝐷, 𝐸, 𝑅, 𝑃, 𝑇 > (1)
Let’s explain each element from the system 𝑆.
The component D (2) represents the Data Layer. It is the set aggregated text
documents as described in Data section. Each 𝑑 is a “time slice” of the corpus,
grouped either by day or by hour, forming the raw textual input for the system: 𝐷 = { 𝑑 , 𝑑 , . . . 𝑑 } , (2)
where N – total number of observations.
The component E represents a Semantic Layer. It is the global set of all unique
“conceptual atoms” extracted from the corpus. A Goal-hypernym pairs are core in
the system S. Each hypernym provides taxonomic classification for corresponded
goal element.
The component R represents the Relational Layer. It is the set of all relations
between the semantic elements in E. While E is just a flat list of pairs, R models their
connections like semantic similarity, parent-child hierarchies, co-occurrence in timeline.
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 138
The component P represents the Procedural Layer. This is the set of all com-
putational procedures and algorithms that transform the data. Firstly, it is LLM
inference for extracting E from D. Secondly, it is a vector embedding and clustering
algorithms to define R, Thirdly, it is graph construction algorithms to build the final
ontology, and finally it is evaluation metrics to validate its whole structure.
The component T represents the Temporal Layer. This is the set of operations
that analyze the ontology evolution over time.
Multi-Model Extraction of Goals and Hypernyms
The core step of our methodology is the extraction of the semantic element set E
from the data D. This may include some challenges, because of LLMs can halluci-
nate (as mentioned before – generate in some case possible but false information)
or produce inconsistent outputs. To reduce these risks and provide high semantic
consistency, we developed a multi-model ensemble approach. Each document 𝑑
in D was processed in parallel by a mixed set of M (3) different LLMs: 𝑀 = { 𝑀 , 𝑀 , . . . 𝑀 } . (3)
The models were chosen for their diverse architectures and training data to
ensure a range of “opinions”. For each model 𝑀 and document 𝑑 , we used a
structured prompt engineering technique [15]. The prompt tasked the model to act
as a domain expert and extract all conceptual pairs representing a specific techno-
logical capability (goal) and its general class (hypernym). The output of this step is
a set of candidate pairs for that specific mode (4):
𝐸 = 𝑀 (𝑑 ) = { (𝑔 , ℎ ) , (𝑔 , ℎ ) , . . . (𝑔 , ℎ ) )}. (4)
At the next step the results (each 𝐸 ) were effected by a consensus function
FU (5), which aggregates only those pairs agreed upon by at least two models.
Multimodal agents have recently demonstrated remarkable foresight capabilities in
complex predictive tasks. In [16], En et al. introduce “Merlin”, a vision-language
model explicitly trained to develop “foresight minds”: 𝑒 = 𝐹𝑈( 𝑀 (𝑑 ), 𝑀 (𝑑 ), . . . 𝑀 (𝑑 ) ) = { (𝑔 , ℎ ) , . . . (𝑔 , ℎ ) } . (5)
Finally, the global set 𝐸 (6) is validated set by multi-modal approach, which
is a part of nodes of our future ontology graph, which will be constructed and ana-
lyzed over time: 𝐸 = ∪ 𝑒 , (6)
where K – total number of elements.
Construction of Semantic Space and Hierarchical Ontology
After creating the global set E of validated semantic elements 𝐸 = { 𝑒 | 𝑖 𝜖 [1, 𝐾]}, the next step requires transforming this unstructured set into
a hierarchical ontology. This process can be achieved by embedding the elements
within a high-dimensional semantic space. It helps to construct the hierarchical
structure of domain area in modern way for foresight studies.
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 139
Semantic Space Projection
Let’s formalize next statement: ∀ 𝑒 𝜖 𝐸, 𝑤ℎ𝑒𝑟𝑒 𝑖 𝜖 [1, 𝐾] are mapped into a con-
tinuous semantic space using a pre-trained embedding function f (7): 𝑣 = 𝑓(𝑒 ), 𝑓: 𝐸 → 𝑅 . (7)
Here, 𝑣 describes the n-dimensional vector embedding of element 𝑒 in se-
mantic space. The selection of the embedding function f is important. It has to be
an open-source model, that is a top performer on standardized benchmarks for mul-
tilingual text (especially for English, Ukrainian and Russian languages), such as the
Massive Text Embedding Benchmark (MTEB) [17]. MTEB is a Python framework
designed for the systematic evaluation of text embedding models and retrieval sys-
tems. We selected the text-embedding-3-large model from OpenAI, because it has
superior performance in capturing fine-grained semantic relationships across tech-
nical and multilingual texts.
The next step is define distance metric 𝑑(𝑒 , 𝑒 ) (8) in space E between two
elements from E: 𝑑(𝑒 , 𝑒 ) = 1 − 𝑐𝑜𝑠(𝑣 , 𝑣 ), 𝑑: 𝐸 × 𝐸 → 𝑅, (8)
where 𝑒 , 𝑒 𝜖 𝐸, 𝑐𝑜𝑠(⋅,⋅) – cosine distance.
Cosine distance was selected over Euclidean distance as it is invariant to vec-
tor magnitude and measures only the orientation between vectors. In high-dimen-
sional spaces like text embedding, this is a more reliable measure of semantic sim-
ilarity, where small distances 𝑑 → 0 indicate high similarity.
Hybrid Agglomerative Clustering
To build the ontology, we developed a hybrid algorithm that combines the algorith-
mic clustering with gradient optimization approaches and conceptual understand-
ing of LLM. This process is iterative, building the hierarchy from the bottom up.
Phase one. Vector-Based Agglomeration
Let at any iteration t the set E is partitioned into 𝑘( ) disjoint clusters 𝐶( ) = {𝐶( ) , 𝐶( ), . . . 𝐶 ( )( ) }, where ⋃ 𝐶( ) ⊃ 𝐸( ) and 𝐶( ) ∩ 𝐶( ) =⊘, (𝑖 ≠ 𝑗).
At the first iteration 𝐶( ) = {𝑒 , 𝑗 ∈ [1, 𝑘]}.
The base idea how to merge elements in one cluster – the shortest distance
between any point in one cluster and any point in the other. This is the classic sin-
gle-linkage rule (9): 𝑚𝑖𝑛 ∑ ∑ 𝑑(𝑒 , 𝑒 ), . (9)
This means clusters only merge if they are close on the inside and well-sepa-
rated on the outside cluster (10). It stops loose or accidental links from forming
early, so our chains stay clean and meaningful:
(𝐶 ∗, 𝐶 ∗) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑑 (𝐶( ), 𝐶( )) . (10)
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 140
So, the two (or more) clusters are merged to form a new cluster for the next
iteration (11): 𝐶( ) = 𝐶 ∗( ) ∪ 𝐶 ∗( ). (11)
The last question on this phase is to determine criteria of optimal base cluster
numbers. We use the “elbow criterion” applied to the within-cluster residual func-
tion 𝐽(𝑘) (12) to determine the optimal number of base clusters 𝑘∗. The residual is
the sum of the squared distances to the cluster centroids. 𝐽(𝑘) = ∑ ∑ 𝑣 − 𝑚 , (12)
where 𝑚 is the centroid (13) of cluster 𝐶 . 𝑚 = | | ∑ 𝑣∈ . (13)
As 𝑘 increases, 𝐽(𝑘) decreases. The elbow point 𝑘∗ is detected from the dis-
crete first differences (gradients) (14): 𝛥𝐽(𝑘) = 𝐽(𝑘) − 𝐽(𝑘 − 1). (14)
The “elbow” 𝑘∗ is identified as the point where the rate of residual stabilizes ( ∗)( ∗ ) → 1, indicating that further merges would combine conceptually distinct
groups. This leads to an optimal base partition.
Phase two. LLM-in-the-Loop Semantic Labeling and Merging
Once the 𝑘∗ base clusters are identified, the algorithm shifts from just vector-based
merging to a more abstract, concept-based merging using LLM.
Each base cluster 𝐶 ∈ 𝐶( ) is “semantically labeled”. We use prompt engi-
neering with LLM to generate an abstract hyper-concept (hyperonym) 𝐿 (15) that
best describes all elements in the cluster. The prompt includes a representative sam-
ple of terms from the cluster, for example the 5–10 elements closest to the centroid 𝑚 (13):
𝐿( ) = 𝐿𝐿𝑀(𝑡𝑒𝑟𝑚𝑠(𝐶( ))). (15)
For example, a cluster containing “jam GPS”, “spoof Galileo” and “disrupt
GLONASS” might receive the label 𝐿 = “GNSS Disruption Techniques”.
The algorithm now proceeds to merge these 𝑘∗ labeled clusters. Instead of
using 𝑑 on all vectors from E, we merge based on the semantic similarity of the
LLM-generated labels. At each new iteration, the algorithm merges the two clusters 𝐶 and 𝐶 , whose labels 𝐿 and 𝐿 have the highest similarity from phase one.
The newly formed cluster 𝐶 = 𝐶 ∪ 𝐶 is then re-labeled by the LLM.
One of the most important things is how to work with graph structure.
NetworkX Python library is dedicated graph database system, chosen for its effi-
ciency in handling topological data and pathfinding queries, which helps establish
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 141
the fundamental connectivity of roads. In the NetworkX environment, the graph
components (nodes as cities, edges as routes) serve as the building blocks for deci-
sion optimization. To maximize network traversal performance, the graph schema
stores only essential topological data and pre-calculated attributes (e.g., node/edge
IDs, mode, distance, slope). These attributes are critical for applying necessary con-
straints during the traversal process, thereby guaranteeing query speed and rele-
vance for decision support.
Phase three. Convergence and Ontology Stabilization
This hierarchical aggregation process (phase 2) proceeds iteratively until a stopping
condition is met. The criteria includes: difference in total distance (12) less then
threshold, cluster quality stabilizes or reached minimum clusters count.
The primary criterion based on the residual of the sum of the squared distances
to the cluster centroids. Merging stops, when the newly generated 𝐽(𝑘∗) has no
difference with 𝐽(𝑘∗ ): | 𝐽(𝑘∗) − 𝐽(𝑘∗ )| < 𝜏 , where 𝜏 is small value.
The second one is cluster quality stabilizes. The mean Silhouette score S (16)
for the partition 𝐶( ) reaches a local maximum (changes by less than a small thresh-
old 𝜀): 𝑆(𝑖) = ( ) ( ){ ( ), ( )}, (16)
where 𝑎(𝑖) is average distance between 𝑖-element and all of other points in its own
cluster and 𝑏(𝑖) is average distance between 𝑖-element and next nearest cluster cen-
troid.
The last one is minimum cluster count. It is defined minimum number of top-
level categories (it can be set by experts from domain area).
Empirically, we observed, that this semantic stabilization starts after approxi-
mately five to seven iterations, 𝜏 equals approximately 0.01, 𝜀 equals approximately
0.01. In the end a domains in the corpus have been successfully identified and orga-
nized. The final result is a tree-like hierarchical structure to represent final ontology.
Trend analysis
The final objective of our system is not to construct a static ontology, but to under-
stand its evolution over time. This analysis is formalized through the
Temporal Layer T of our system S. We analyze the frequency of the semantic ele-
ments E across different temporal aggregations to identify emerging, stabilizing,
and disappearing trends.
To measure the frequency of an individual semantic element 𝑒 𝜖 𝐸at a spe-
cific time interval, we employ a TF-IDF approach [18]. Importance, to reduce noise
in data, we will use elements only from first layer in constructed ontology. TF-IDF
approach is adapted to our temporal framework, where the “document” is defined
as a time-aggregated corpus.
The corpus D is temporally partitioned into a sequence of time slices {𝐷( )| 𝑡 = 1, 𝑇 }, where t is the index of the temporal interval (e.g., month
or day).
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 142
Term Frequency (TF) (17) is frequency of an element 𝑒 within a specific time
slice 𝐷( ): 𝑇𝐹(𝑒 , 𝐷 ) = . (17)
Inverse Document Frequency (IDF) (18) is document frequency measures the
number of time slices containing the element 𝑒 : 𝐼𝐷𝐹(𝑒 , { 𝐷 }) = 𝑙𝑜𝑔(|{ : }|) (18)
Then TF-IDF Score is score 𝜌 (𝑒 , 𝑡) (19) for element 𝑒 at time t. The 𝜌 (𝑒 , 𝑡)
scores allow us to quantify which goals and hypernyms were most distinctive dur-
ing a given period, rather than just most frequent: 𝜌 (𝑒 , 𝑡) = 𝑇𝐹(𝑒 , 𝐷 ) × 𝐼𝐷𝐹(𝑒 , { 𝐷 }) . (19)
To capture both macro-level shifts and micro-level volatility, we apply two
distinct temporal aggregation strategies based on the desired analysis scope.
Long-Term Trend Analysis is to analyze the evolution of the overall ontology across
the entire multi-year corpus, the data is aggregated by month. This macro-level
view smoothed out short-term noise, providing a clear picture of how high-level
goals and technologies (hypernyms) emerge and stabilize over quarters and years.
Short-Term Dynamic Analysis is to investigate localized tendencies and immediate
responses, the data is aggregated by day. This finer-grained resolution allows us to
detect rapid shifts in discussion focus, corresponding to the initial emergence.
Together, these twin knowledge structures provide dual temporal information for
foresight analysis, to underline the stability of long-term intentions with the vola-
tility of short-term discursive dynamics.
RESULTS
Moving from theory to practice, this section introduces a compelling case study to
demonstrate the application and practical utility of our proposed LLM-driven meth-
odology in addressing a real-world decision challenge. Based on the methodology
described above, we now present the results of the semantic extraction, hierarchical
ontology construction, and comparative analysis across temporal resolutions.
Our goal is to build approach to identify, structure, and compare goal-related con-
cepts with cost efficiency, cross-lingual robustness, and temporal flexibility
required for foresight studies.
We begin by describing the characteristics of the extracted semantic elements,
including base prompt (translated in English from Ukrainian) for goals extraction
with corresponded hypernyms, the numbers of semantic elements on different time
intervals. Next, we analyze the hierarchical structure of the resulting ontologies,
identifying key hypernyms and dominant goal classes that emerged over the studied
period (2022–2025). Finally, we perform a comparative analysis between daily and
hourly ontologies, assessing structural stability.
Goals and Hypernyms extraction. This stage implements the multi-model
ensemble procedure described early, designed to extract reliable goal-hypernym
pairs E from the raw text documents D. Each document 𝑑 𝜖 𝐷 (representing a daily
or hourly time slice) was processed by a set of five state-of-the-art Large Language
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 143
Models: GPT-3.5, GPT-4, Gemini, Grok, and DeepSeek. This diversity in model
architecture and training data was chosen to minimize any single model's biases or
hallucination effect.
The specific instruction provided to each model 𝑀 was through a structured Base
Prompt (translated into English): I will provide you with news on the topic of the Rus-
sian-Ukrainian war. All posts are related to the topic of the Russian-Ukrainian war.
Your task is to conduct an analytical analysis and submit the result exclusively in JSON
format. Required: 1. Identify **goals** that are mentioned in the texts. - Consider
short-term, long-term, tactical and strategic goals. - For each goal, highlight the key
technologies/means that were used or are planned to achieve. 2. For each goal, deter-
mine its **hypernym** (a more general concept). Also provide a **hypernym for this
hypernym** (i.e. the second level of generalization). 3. Identify **results** that
are mentioned in the texts. – Results can also be short-term, long-term, tactical or stra-
tegic. – For each result, indicate the key technologies/tools that were used to achieve
it. 4. For each result, also provide its **hypernym** and **hypernym to hypernym**.
### Response format (JSON): {{ "goals": [ {{ "text": "liberation of a specific settle-
ment", "type": "tactical / strategic / short-term / long-term", "technologies": ["kami-
kaze drones", "artillery"], "hypernym": "military operation", "hypernym_of_hyper-
nym": "military activity" }}, ... ], "results": [ {{ "text": "destruction of ammunition
depot", "type": "tactical result", "technologies": ["missile strike", "UAV"], "hyper-
nym": "strike on military infrastructure", "hypernym_of_hypernym": "military activ-
ity" }}, ... ] }} ### Important requirements: – Answer only in UKRAINIAN. – Do not
invent data, but rely only on the posts provided. – If information is missing – leave an
empty list or null. – Format the response only as valid JSON without additional
comments. Here is the message text.
To determine the relevance of the extracted semantic elements, we applied a
consensus filtering function FU. A candidate pair was confirmed as a validated
element 𝑒 𝜖 𝐸 only if it was independently identified by at least two distinct LLMs.
This threshold significantly reduced semantic noise and improved the confidence
that the extracted elements genuinely represent the collective intent present in the
source discourse. We will investigate adaptive thresholding mechanisms in the
future work based on semantic similarity.
The effectiveness of this multi-model (multi-agent) extraction process is
described by the resulting number of distinct semantic elements (goals and topics)
identified per text. This distribution is the key to understanding the filling and tem-
poral density of the corpus.
The distribution of the count of topics per text for the daily grouping is pre-
sented in Table 1.
T a b l e 1 . Daily topic count of Semantic elements distribution per document for
long time period. Each semantic element is couple (goal, hypernym)
Count of semantic elements Count of documents
1 86
2 183
3 193
4 85
5 26
6 5
8 2
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 144
The distribution of the count of topics per text for the hourly grouping is
shown in Table 2.
T a b l e 2 . Hourly topic count of Semantic elements distribution per document
for short time period. Each semantic element is couple (goal, hypernym)
Count of semantic elements Count of documents
2 40
3 303
4 483
5 159
6 23
7 12
9 1
10 1
13 1
These Tables 1 and 2 visually represent the volume of validated semantic
information available for long-term trend analysis (daily) versus local dynamic
analysis (hourly).
The next step is to provide results of the hierarchical ontology construction.
Firstly, the validated goal-hypernym pairs were mapped into a high-dimensional
vector space using OpenAI’s text-embedding-3-large model, earlier were described
why we stop on those embedding model. Secondly, we use GPT-4 from the same
LLM provider (OpenAI) to generate the abstract hyper-concepts for the higher
levels of the ontology hierarchy. This strategic decision to use embedding and rea-
soning models from the same underlying provider, it is leaded to minimize potential
semantic shift or misalignment, providing that the vector space used for clustering
is highly congruent with the contextual understanding employed by the model gen-
erating the conceptual labels.
The vectors representing the extracted goals were clustered using a hierar-
chical agglomerative approach based, which is described upper in theory part.
The first iteration of the resulting ontology structure is visualized in Fig. 2.
Fig. 2. Ontology structure which was obtained by using hierarchical agglomerative approach
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 145
To determine the optimal boundary for the initial cluster separation, we ana-
lyzed the change in inter-cluster distance (the “gradient” or first difference) across
the hierarchy. This analysis identified a critical point, or the “best cut,” at a seman-
tic distance threshold of 0.35. It equals to 156 distinct clusters. Dependence
between “total distance” of elements and cluster division threshold is illustrated at
Fig. 3.
Fig. 3. Distance dependency. Solid line – second derivative of normalized total distance
(left axis OY). Dashed line – normalized total distance (right axis OY). OX axis is a border
value of semantic similarity
The next step was to generate high-level semantic descriptors for each cluster.
For each of the 156 clusters 𝐶 , a representative subset of elements was selected as
up to ten semantic elements, that had the smallest cosine distance to the cluster's
centroid 𝑚 . These ten representative elements served as the input for GPT-4,
which was tasked with generating the cluster label 𝐿 (see formula 15). The prompt-
institution for model:
“You need to provide a hypernym for the list of terms Let me remind you that
a hypernym is a word (or phrase) with a broader, generalized meaning, denoting
a generic concept, class, or set of objects. Please provide the answer without com-
ments, just the hypernym. List of terms:”
Received labels, (representing abstract hyper-concepts) were used for the next
iterations of the algorithm for higher-level merging, which is based on the semantic
similarity between the labels themselves. Table 3 presents illustrative examples
from this stage. Notably, the table is in the original multilingual format of the
dataset to underscore the framework’s cross-lingual robustness and to showcase the
raw inputs processed by the LLM. The header of each column displays the
LLM-generated hyper-concept, based on the top 10 nearest elements in corre-
sponded cluster.
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 146
T a b l e 3 . Example of hyper-concepts
security military activity logistics education technology
security support of
military actions
logistics
operations
education
and training
Technology
development
tactical
security
military
interaction
logistics
project
educational
programs
technical
development
operational
security
Military
operations
military
logistics
educational
system
development of new
technologies
territorial
security
military
observation
logistics
system
educational
process
technology
development
National
security
Military
campaign
Logistics
security
educational
initiative
Technology
implementation
protection of
national security
supply of
military means logistics education technology
development
Security
provision
restructuring of the
military fleet
innovations in
logistics
educational
activity technology testing
security
systems
military
cooperation
Weapons
logistics
education
and science
scientific and
technological progress
security
enhancement military security logistics
optimization
educational
project
technological
development
cybersecurity Military
communication
logistics
support
educational
infrastructure
technology
development
The headers of each column (titled by bold) display the LLM-generated hyper-con-
cepts, based on the top 10 nearest elements in corresponded clusters.
The iterative building process was monitored using key metrics to determine
the optimal stopping point. The first one is average similarity distance between new
names of clusters and old names. The second one is Silhouette score.
For the ontology derived from daily grouping, the iterative convergence pro-
cess reached stability after five iterations. To investigate short-term semantic
dynamics and reveal local fluctuations was replicated using the dataset aggregated
at the hourly level (often interpreted in foresight studies as weak signals the entire
pipeline). Crucially, the convergence process for the hourly-derived ontology also
achieved stability after five iterations.
The final structural metrics for both ontologies are summarized in Table 4.
T a b l e 4 . Convergence metrics
Iteration Sematic distance
day group
Sematic distance
hour group
Silhouette score
day group
Silhouette score
hour group
1 0.412 0.379 0.546 0.516
2 0.298 0.266 0.457 0.403
3 0.201 0.176 0.35 0.373
4 0.163 0.153 0.25 0.32
5 0.155 0.141 0.24 0.31
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 147
The global structure of both the daily and hourly ontologies is a big and com-
plex knowledge graph. We must focus on a specific thematic area to illustrate the
key findings of our temporal comparison.
Fig. 4 presents the final converged sub-graph for “Military Actions” which
were derived by the Daily Grouping.
Fig. 4. “Military Actions” for day grouping
In contrast, Fig. 5 displays the equivalent “Military Actions” sub-graph which
were derived by the Hourly Grouping
Fig. 5. “Military Actions” for hourly grouping
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 148
To validate the robustness of the proposed approach, a comparative analy-
sis was conducted between the long-term (daily) and short-term (hourly) ontol-
ogies. This comparison serves as a semi-validation mechanism, allowing us to
assess whether both temporal models capture a consistent semantic representa-
tion of the domain. Across both ontologies, a total of 886 unique thematic con-
cepts were identified. Among these, 262 concepts were common to both struc-
tures, representing the core semantic intersection. The daily (long-term)
ontology contained 405 unique topics not present in the short-term model, while
the hourly (short-term) ontology introduced 219 distinctive topics absent from
the long-term perspective.
The last part is to present the results of the temporal analysis, where the fre-
quency of the established semantic elements E is tracked over time using the
adapted TF-IDF score. To normalize visualization of work, we show the top the-
matic for both analysis: long-term and short-term.
To capture macro-level shifts and the strategic evolution, the prominence of
high-level goals and hypernyms was aggregated and visualized by month across
the entire study period. The corresponding heatmap presents on Fig. 6.
Fig. 6. TF-IDF for long-terms ontology
“Education” and “Attacks on Infrastructure Facilities” have decreased in fre-
quency. Meanwhile topics related to “Innovation” and “Defense” demonstrate a
sustained and increasing frequency of mention, indicating a long-term strategic in-
terest. Core operational topics, such as “Military Activity” and “Logistics”, remain
consistently present throughout the timeline.
To investigate micro-level volatility and detect the immediate impact of
events, the analysis was replicated with data aggregated by day. The corresponding
daily heatmap presents on Fig. 7.
Topics concerning Military Activity, Modernization, and Technology
exhibit stable, high-frequency mentions across the daily periods. In contrast,
other goal-related topics that spike following specific events tend to gradually
fade away.
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 149
Fig. 7. TF-IDF for short-terms ontology
CONCLUSIONS
This study proposed and validated a robust approach for automated ontology con-
struction based on large language models (LLMs), provided temporal analysis in
different time frames, and it is applied to the domain of communication technolo-
gies. The approach was developed and tested on multilingual social media data
collected from the “Victory Drones” Telegram channel [11] over the period from
October 2022 to September 2025.
The dataset was gathered using asynchronous distributed parsing methods
implemented with advanced Python libraries, ensuring efficient and reliable extrac-
tion of posts from large-scale Telegram data. After filtering irrelevant content like
channel’s info or advertising, the final corpus provided a representative record
of thematic domain. Temporal aggregation was performed at both daily and hourly
resolutions, enabling the comparison of long-term and short-term semantic
dynamics.
The extraction process relied on multiple LLM configurations to identify goal
statements and their corresponding hypernyms from raw texts. A consensus mech-
anism ensured robustness by considering only those semantic pairs, which were
consistently reproduced across several LLMs, minimizing hallucination risk.
The extracted semantic elements were embedded into a high-dimensional vec-
tor space, where similarity was computed using cosine distance. Clustering and
hierarchical merging were performed iteratively, with the optimal number of clus-
ters determined via optimization criterion. A key empirical finding is that conver-
gence (the point at which further iterations cease to produce meaningful new clus-
ters) occurred consistently after five iterations for both the daily (long-term) and
hourly (short-term) ontologies. This stability suggests, that we have different tem-
poral resolutions, buy the underlying semantic data have a highly similar structural
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 150
organization. Across the two ontologies, 886 distinct thematic concepts, 262 con-
cepts (29.6 %) formed the shared semantic core, appearing in both hierarchies.
The daily (long-term) ontology contributed 405 unique topics, capturing slow-
evolving, structural themes. On the other side, the hourly (short-term) ontology
introduced 219 unique topics, reflecting rapid signals. This comparison reveals that
some topics emerge briefly within short intervals and then fade, capturing real-time
fluctuations in public attention. However, when observed over longer periods, cer-
tain topics demonstrate persistence, reappearing across multiple temporal windows
and forming the backbone of the long-term semantic structure.
The temporal analysis component helped to map the static ontology into a
dynamic tracking tool via the TF-IDF approach. Here are three key components:
topics that drive fast on a short-term interval tend to fade away rapidly in promi-
nence; some strategic topics, show strong, stable, or increasing prominence in the
long-term analysis and some topics show stable, high-frequency occurrence across
both the short-term and long-term frames.
The proposed approach demonstrates that LLM-driven ontology construction
can effectively reproduce some analysis from domain experts, such as identifying
goals, abstracting hypernyms, and structuring thematic relations. This makes the
method highly cost-efficient and scalable.
In summary, the research opens, that the semantic ontologies received from
LLM-based analysis can provide a stable and interpretable representation over
time. The observed convergence behavior, structural similarities, and interpretable
divergences between long-term and short-term perspectives validate the robustness
of the proposed framework. It is foundation for future automated foresight systems.
REFERENCES
1. M. Zgurovsky, N. Pankratova, System Analysis & Intelligent Computing, Theory and
Applications. Berlin: Springer, 2022, 432 p. doi: http://doi.org/10.1007/978-3-030-
94910-5
2. A. Rosa, N. Gudowsky, P. Repo, “Sensemaking and lens-shaping: Identifying citizen
contributions to foresight through comparative topic modelling,” Futures, vol. 129,
pp. 1–15, 2021. doi: http://doi.org/10.1016/j.futures.2021.102733
3. C. Mühlroth, M. Grottke, “Artificial Intelligence in Innovation: How to Spot Emerging
Trends and Technologies,” IEEE Transactions on Engineering Management, vol. 69,
no. 2, pp. 493–510, April 2022. doi: 10.1109/TEM.2020.2989214
4. Y. Kishita, T. Kusaka, Y. Mizuno, Y. Umeda, “Toward theory development in futures
and foresight by drawing on design theory: A commentary on Fergnani and Chermack
2021,” Futures & Foresight Science, vol. 3, issue 3-4, 2021, pp. 1–3. doi:
https://doi.org/10.1002/FFO2.91
5. O. Matei, R. Erdei, D. Delinschi, “Multimodal transportation overview and optimiza-
tion ontology for a greener future,” Artificial Intelligence in Intelligent Systems: Pro-
ceedings of 10th Computer Science On-line Conference, vol. 2, pp. 158–172. Springer
2021. doi: https://doi.org/10.1007/978-3-030-77445-5_15
6. Y. Chen, S. Sabri, A. Rajabifard, M. Agunbiade, “An ontology-based spatial data har-
monisation for urban analytics,” Computers, Environment and Urban Systems,
vol. 72, pp. 177–190. Elsevier, 2018. doi: https://doi.org/10.1016/j.compenvurb-
sys.2018.06.009
7. T. Brown et al., “Language Models are Few-Shot Learners,” arXiv preprint, 75 p.,
2020. Available: https://arxiv.org/abs/2005.14165
Automated semantic ontology construction for foresight studies using large language models
Системні дослідження та інформаційні технології, 2026, № 2 151
8. J. Achiam et al., “Gpt-4 technical report,” arXiv preprint, 100 p., 2023. Available:
https://arxiv.org/abs/2303.08774
9. Gemini Team Google: Rohan Anil et al., “Gemini: a family of highly
capable multimodal models,” arXiv preprint, 90 p., 2025. Available:
https://arxiv.org/abs/2312.11805
10. xAI.Grok 3 beta - the age of reasoning agents. Available: https://x.ai/news/grok-3/
11. Victory Drones [Telegram channel], 2022–2025. Available: https://t.me/Victo-
ryDrones
12. Y. Chen, X. Pan, Y. Li, B. Ding, J. Zhou, “EE-LLM: Large-scale training and infer-
ence of early-exit large language models with 3D parallelism,” arXiv preprint, 27 p.,
2024. Available: https://arxiv.org/abs/2312.04916
13. O. Michel, R. Bifulco, G. Retvari, S. Schmid, “The Programmable Data Plane: Ab-
stractions, Architectures, Algorithms, and Applications,” Proc. ACM Computing Sur-
veys (CSUR), vol. 54, issue 4, pp. 1–36, 2021. doi: https://doi.org/10.1145/3447868
14. T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,”
Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, pp. 38–45, 2020. doi:
https://doi.org/10.18653/v1/2020.emnlp-demos.6
15. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, “Pretrain, prompt, and
predict: A systematic survey of prompting methods in natural language processing,”
ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023. doi:
https://doi.org/10.1145/3560815
16. E. Yu et al., “Merlin: Empowering Multimodal LLMs with Foresight Minds,” arXiv
preprint, 28 p., 2023. doi: https://doi.org/10.48550/arXiv.2312.00589
17. N. Muennighoff, N. Tazi, L. Magne, N. Reimers, “MTEB: Massive Text Embedding
Benchmark,” Proceedings of the 17th Conference of the European Chapter of the
Association for Computational Linguistics (EACL), Dubrovnik, Croatia, 2023,
pp. 2014–2037. doi: https://doi.org/10.18653/v1/2023.eacl-main.148
18. A. Lucky, T. Kartik, B. Gaurav, M. Ankush, “Authorship Clustering using TF-IDF
weighted Word-Embeddings,” Proceedings of the 11th Annual Meeting of the Forum
for Information Retrieval Evaluation (FIRE 19). Association for Computing
Machinery, New York, NY, USA, 2019, pp. 24–29. doi:
https://doi.org/10.1145/3368567.3368572
Received 15.12.2025
INFORMATION ON THE ARTICLE
Serhii A. Lupenko, ORCID: 0000-0002-6559-0721, Opole University of Technology,
Poland, e-mail: lupenko.san@gmail.com
Mykhailo V. Stoliar, ORCID: 0009-0009-3624-3147, Educational and Research Institute
for Applied System Analysis of the National Technical University of Ukraine “Igor Sikor-
sky Kyiv Polytechnic Institute”, Ukraine, e-mail: misha.stolyar99@gamil.com
Oleksandr M. Terentiev, ORCID: 0000-0002-4288-1753, Educational and Research
Institute for Applied System Analysis of the National Technical University of Ukraine “Igor
Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: o.terentiev@gmail.com
Volodymyr V. Savastiyanov, ORCID: 0000-0002-2052-0420, Educational and Research
Institute for Applied System Analysis of the National Technical University of Ukraine
“Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: vvs.in.ua@gmail.com
S. A. Lupenko, M. V. Stoliar, O. M. Terentiev, V. V. Savastiyanov
ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 152
АВТОМАТИЗОВАНА КОНСТРУКЦІЯ СЕМАНТИЧНОЇ ОНТОЛОГІЇ ДЛЯ
ДОСЛІДЖЕНЬ ПЕРЕДБАЧЕННЯ ІЗ ВИКОРИСТАННЯМ ВЕЛИКИХ
ЛІНГВІСТИЧНИХ МОДЕЛЕЙ / C.А. Лупенко, М.В. Столяр, О.М. Терентьєв,
В.В. Савастьянов
Анотація. Сучасні досягнення у сфері великих мовних моделей (LLM) дають
змогу автоматизовано виявляти семантичні структури та нові сигнали, які ная-
вні в потоках текстової інформації. Це дає змогу автоматизувати рутинні робочі
процеси, які пов’язані із розробленням прогнозних моделей на основі систем
безперервного аналізу даних. Мета дослідження – розроблення і валідація авто-
матизованої схеми для вилучення, структурування та порівняння семантичних
онтологій за допомогою LLM. Для аналізу даних із різноманітних платформ
соціальних мереж використано паралізацію процесів. Дані спочатку відфільтро-
вано, а саме: вилучено ті, що не належать до предметної досліджуваної галузі.
Ключові семантичні елементи, цілі та гіпероніми, що відповідають предметній
галузі, вилучено за допомогою кількох конфігурацій LLM із механізмом консе-
нсусу для забезпечення семантичної надійності та мінімізації галюцинацій та
вигадувань фактів зі сторони LLM. Вилучені елементи представлено у багато-
вимірному векторному просторі, ітеративно кластеризовано за допомогою мет-
рики косинусної подібності та ієрархічно об’єднано. Процес конвергенції та
структурну стабільність проаналізовано за допомогою критерію ліктя та метрик
подібності. Запропонований підхід – економічно ефективна альтернатива тра-
диційному експертному аналізу прогнозування. Об’єднуючи воєдино семанти-
чне вилучення, кероване LLM із кількісною кластеризацією, цей метод дозволяє
ідентифікувати нові тенденції, слабкі сигнали та довгострокові тематичні стру-
ктури. Отримано результати дослідження, які підкреслюють великий потенціал
семантичного моделювання на основі LLM як основи для автоматизованих
систем прогнозування.
Ключові слова: передбачення, великі мовні моделі, семантична онтологія,
сценарний аналіз, слабкі сигнали, ієрархічна кластеризація.
|
| id | journaliasakpiua-article-365265 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2026-07-01T01:00:18Z |
| publishDate | 2026 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/83/b22ecb7ed7c105c96e5d77066ae0eb83.pdf |
| spelling | journaliasakpiua-article-3652652026-06-30T06:14:59Z Automated semantic ontology construction for foresight studies using large language models Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей Lupenko, Serhii Stoliar, Mykhailo Terentiev, Oleksandr Savastiyanov, Volodymyr передбачення великі мовні моделі семантична онтологія сценарний аналіз слабкі сигнали ієрархічна кластеризація foresight large language models semantic ontology scenario analysis weak signals hierarchical clustering Recent advances in large language models (LLMs) enable the automated discovery of semantic structures and emerging signals within text streams, offering an opportunity to redesign foresight workflows into continuous, data-driven systems. This study aims to develop and validate an automated framework for extracting, structuring, and comparing semantic ontologies using LLMs. The paralyzed approach was used for data mining from social media platforms and filtering non-domain data. The key semantic elements, goals and hypernyms corresponded, were extracted using multiple LLM configurations, with a consensus mechanism to provide semantic reliability and minimize hallucination. The extracted elements were embedded in a high-dimensional vector space, clustered iteratively using cosine similarity, and merged hierarchically. Convergence process and structural stability were analyzed using the elbow criterion and similarity metrics. The Proposed approach provides a cost-efficient alternative to traditional expert-based foresight analysis. By integrating LLM-driven semantic extraction with quantitative clustering, it enables the identification of emerging trends, weak signals, and long-term thematic structures. The results highlight the potential of LLM-based semantic modeling as a foundation for automated foresight systems. Сучасні досягнення у сфері великих мовних моделей (LLM) дають змогу автоматизовано виявляти семантичні структури та нові сигнали, які наявні в потоках текстової інформації. Це дає змогу автоматизувати рутинні робочі процеси, які пов’язані із розробленням прогнозних моделей на основі систем безперервного аналізу даних. Мета дослідження – розроблення і валідація автоматизованої схеми для вилучення, структурування та порівняння семантичних онтологій за допомогою LLM. Для аналізу даних із різноманітних платформ соціальних мереж використано паралізацію процесів. Дані спочатку відфільтровано, а саме: вилучено ті, що не належать до предметної досліджуваної галузі. Ключові семантичні елементи, цілі та гіпероніми, що відповідають предметній галузі, вилучено за допомогою кількох конфігурацій LLM із механізмом консенсусу для забезпечення семантичної надійності та мінімізації галюцинацій та вигадувань фактів зі сторони LLM. Вилучені елементи представлено у багатовимірному векторному просторі, ітеративно кластеризовано за допомогою метрики косинусної подібності та ієрархічно об’єднано. Процес конвергенції та структурну стабільність проаналізовано за допомогою критерію ліктя та метрик подібності. Запропонований підхід – економічно ефективна альтернатива традиційному експертному аналізу прогнозування. Об’єднуючи воєдино семантичне вилучення, кероване LLM із кількісною кластеризацією, цей метод дозволяє ідентифікувати нові тенденції, слабкі сигнали та довгострокові тематичні структури. Отримано результати дослідження, які підкреслюють великий потенціал семантичного моделювання на основі LLM як основи для автоматизованих систем прогнозування. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026-06-30 Article Article Peer-reviewed Article application/pdf https://journal.iasa.kpi.ua/article/view/365265 10.20535/SRIT.2308-8893.2026.2.09 System research and information technologies; No. 2 (2026); 134-152 Системные исследования и информационные технологии; № 2 (2026); 134-152 Системні дослідження та інформаційні технології; № 2 (2026); 134-152 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/365265/350714 |
| spellingShingle | передбачення великі мовні моделі семантична онтологія сценарний аналіз слабкі сигнали ієрархічна кластеризація Lupenko, Serhii Stoliar, Mykhailo Terentiev, Oleksandr Savastiyanov, Volodymyr Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title | Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title_alt | Automated semantic ontology construction for foresight studies using large language models |
| title_full | Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title_fullStr | Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title_full_unstemmed | Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title_short | Автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| title_sort | автоматизована конструкція семантичної онтології для досліджень передбачення із використанням великих лінгвістичних моделей |
| topic | передбачення великі мовні моделі семантична онтологія сценарний аналіз слабкі сигнали ієрархічна кластеризація |
| topic_facet | передбачення великі мовні моделі семантична онтологія сценарний аналіз слабкі сигнали ієрархічна кластеризація foresight large language models semantic ontology scenario analysis weak signals hierarchical clustering |
| url | https://journal.iasa.kpi.ua/article/view/365265 |
| work_keys_str_mv | AT lupenkoserhii automatedsemanticontologyconstructionforforesightstudiesusinglargelanguagemodels AT stoliarmykhailo automatedsemanticontologyconstructionforforesightstudiesusinglargelanguagemodels AT terentievoleksandr automatedsemanticontologyconstructionforforesightstudiesusinglargelanguagemodels AT savastiyanovvolodymyr automatedsemanticontologyconstructionforforesightstudiesusinglargelanguagemodels AT lupenkoserhii avtomatizovanakonstrukcíâsemantičnoíontologíídlâdoslídženʹperedbačennâízvikoristannâmvelikihlíngvístičnihmodelej AT stoliarmykhailo avtomatizovanakonstrukcíâsemantičnoíontologíídlâdoslídženʹperedbačennâízvikoristannâmvelikihlíngvístičnihmodelej AT terentievoleksandr avtomatizovanakonstrukcíâsemantičnoíontologíídlâdoslídženʹperedbačennâízvikoristannâmvelikihlíngvístičnihmodelej AT savastiyanovvolodymyr avtomatizovanakonstrukcíâsemantičnoíontologíídlâdoslídženʹperedbačennâízvikoristannâmvelikihlíngvístičnihmodelej |