Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними
Timely detection of forest diseases is an important task for their prevention and spread limitation. The usage of satellite imagery provides capabilities for large-scale forest monitoring. Machine learning models allow to automate the analysis of these data for anomaly detection indicating diseases....
Gespeichert in:
| Datum: | 2024 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Artikel |
| Sprache: | Englisch |
| Veröffentlicht: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2024
|
| Schlagworte: | |
| Online Zugang: | https://journal.iasa.kpi.ua/article/view/286178 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Назва журналу: | System research and information technologies |
| Завантажити файл: | |
Institution
System research and information technologies| _version_ | 1867334439008206848 |
|---|---|
| author | Salii, Yevhenii Lavreniuk, Alla Kussul, Nataliia |
| author_facet | Salii, Yevhenii Lavreniuk, Alla Kussul, Nataliia |
| author_institution_txt_mv | [
{
"author": "Yevhenii Salii",
"institution": "Навчально-науковий фізико-технічний інститут Національного Технічного Університету України \"Київський Політехнічний Інститут імені Ігоря Сікорського\", Київ"
},
{
"author": "Alla Lavreniuk",
"institution": "Навчально-науковий фізико-технічний інститут Національного Технічного Університету України \"Київський Політехнічний Інститут імені Ігоря Сікорського\", Київ"
},
{
"author": "Nataliia Kussul",
"institution": "Навчально-науковий фізико-технічний інститут Національного Технічного Університету України \"Київський Політехнічний Інститут імені Ігоря Сікорського\", Київ"
}
] |
| author_sort | Salii, Yevhenii |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2024-05-23T07:09:36Z |
| description | Timely detection of forest diseases is an important task for their prevention and spread limitation. The usage of satellite imagery provides capabilities for large-scale forest monitoring. Machine learning models allow to automate the analysis of these data for anomaly detection indicating diseases. However, selecting informative features is key to building an effective model. In this work, the application of Bhattacharyya distance and Spearman’s rank correlation coefficient for feature selection from satellite images was investigated. A greedy algorithm was applied to form a subset of weakly correlated features. The experiment showed that selected features allow for improving the classification quality compared to using all spectral bands. The proposed approach demonstrates effectiveness for informative and weakly correlated feature selection and can be utilized in other remote sensing tasks. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2024.1.07 |
| first_indexed | 2025-07-17T10:28:20Z |
| format | Article |
| fulltext |
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul, 2024
86 ISSN 1681–6048 System Research & Information Technologies, 2024, № 1
UDC 004.2
DOI: 10.20535/SRIT.2308-8893.2024.1.07
STATISTICAL METHODS OF FEATURE ENGINEERING
FOR THE PROBLEM OF FOREST STATE CLASSIFICATION
USING SATELLITE DATA
Y.V. SALII, A.M. LAVRENIUK, N.M. KUSSUL
Abstract. Timely detection of forest diseases is an important task for their preven-
tion and spread limitation. The usage of satellite imagery provides capabilities for
large-scale forest monitoring. Machine learning models allow to automate the analy-
sis of these data for anomaly detection indicating diseases. However, selecting in-
formative features is key to building an effective model. In this work, the application
of Bhattacharyya distance and Spearman’s rank correlation coefficient for feature
selection from satellite images was investigated. A greedy algorithm was applied to
form a subset of weakly correlated features. The experiment showed that selected
features allow for improving the classification quality compared to using all spectral
bands. The proposed approach demonstrates effectiveness for informative and
weakly correlated feature selection and can be utilized in other remote sensing tasks.
Keywords: Sentinel-2, vegetation indices, Bhattacharyya distance, feature engineer-
ing, greedy algorithms, Spearman’s rank correlation coefficient.
INTRODUCTION
Monitoring the condition of forest areas is a task for successfully identifying tree
diseases and preventing their further spread. The use of high-resolution satellite
images makes it possible to regularly obtain up-to-date information on large forest
areas [1]. The process of processing and analyzing these data can be automated
using machine learning methods that can detect signs of abnormal vegetation
changes that may indicate the presence of diseases [2; 3].
One of the key steps in building an effective machine learning model for
classification of the forest condition is the careful selection of the most informa-
tive features of the input data for the machine learning model. This allows to sim-
plify the model and reduce the training time, without losing the quality of the
classification. There are a large number of approaches to evaluating the informa-
tiveness of features. Most approaches are statistical, based on evaluating the simi-
larity of data distributions. Among the most well-known methods for assessing
the similarity of distributions, it is worth noting the Kullback–Leibler divergence,
which calculates the relative entropy between two probability distributions. The
higher the divergence value, the more distinct the distributions are [4]. However,
this characteristic is not symmetrical, which limits its use. More universal are
methods for determining the distance between data distributions, which include
the Euclidean metric, that calculates the Euclidean distance between the means of
two distributions, the Wasserstein distance, which measures the minimum “work”
required to transform one distribution into another, or the chi-square distance, that
compares the frequencies of samples from two distributions. Another symmetric
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 87
metric that allows you to measure the difference between two distributions is the
Bhattacharyya distance [5].
This paper investigates the possibility of using the Bhattacharyya distance
and the Spearman correlation coefficient to select the most informative and at the
same time weakly correlated features of multispectral satellite images.
FORMULATION OF THE PROBLEM
Let us consider the task of detecting disease in forest areas based on the analysis
of Sentinel-2 satellite images [6]. The goal of our research is to develop an effec-
tive model that will be able to automatically determine whether a certain area of
the forest is diseased on the basis of multispectral satellite data at different times.
To achieve this goal, two multispectral images will be used: current (Fig. 1,
a) and past (Fig. 1, b). Since coniferous forests were studied, past images are not
limited to a specific date, but it is important that the forest is healthy. Addition-
ally, a vector mask of forest type (Fig. 1, c) from the Forest Type 2018 geospatial
dataset [6] is used, which allows us to determine the areas where the forest is lo-
cated and exclude non-forest areas from the analysis.
In the terminology of machine learning, the task is to build a binary classifier
of each pixel of a multispectral image into the classes “healthy” and “stressed”
(Fig. 1, d) by building and training a machine learning model. For training and
testing of the model, the experts provided a ground truth (disease) mask (Fig. 1,
d), which will be used for training the model and evaluation of its effectiveness.
The experimental study was conducted on the data of the eastern part of the
Grand Est region, France. The area for which training data is available is shown
in Fig. 2.
(a) Current image (b) Past image
(c) Forest Type 2018 (d) Ground truth mask
Fig. 1. Example of input data: (a, b) RGB (B4, B3, B2) composite of Sentinel-2 images;
(c) white — coniferous, gray — deciduous, black — non-forest; (d) white — sick,
black — healthy
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 88
SATELLITE DATA
The work uses multispectral images of the Sentinel-2 satellite obtained in the
eastern part of France. The images contain data in 13 spectral channels (bands)
with a spatial resolution of 10 to 60 meters. Images of areas of coniferous forests
were selected for analysis. All channels with a resolution of 10 m (Fig. 3, a) and
20 m (Fig. 3, b) were used, as well as channel B9 (water vapor) with a resolution
of 60 m (Fig. 3, c). This choice is due to the fact that these ranges are sensitive to
the content of chlorophyll, moisture and other indicators of vegetation, the change
of which may indicate the presence of diseases.
The choice of bands for research is determined by the following considera-
tions. The B4 band (red range) is sensitive to the chlorophyll content of vegeta-
tion because chlorophyll strongly absorbs red light for photosynthesis. A decrease
in the content of chlorophyll during plant stress or disease leads to an increase in
reflection in the red range. B5 band (red-edge) is in the region of rapid change in
reflectance from low (red light) to high (infrared). Changes in this band can carry
information about the content of chlorophyll, which often changes during the de-
velopment of the disease. B9 band (water vapor) is used mainly for atmospheric
correction, but can help in cases of changes in the water content of vegetation un-
der stress. Bands B11 and B12 (short-wave infrared range) are sensitive to the
water content of plants because water absorbs strongly in this range. A decrease
in water content under stress leads to an increase in reflectance, which may indi-
cate the development of the disease.
In addition to the values of the spectral bands themselves, their combinations
(so-called vegetation indices) will also be used as input data. Depending on the
mathematical form of the index, they can highlight information about the state of
Fig. 2. Sections for which train areas are available. The locations of the areas are marked
with black dots
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 89
the vegetation cover; eliminate or minimize the influence of negative factors (for
example, brightness).
Among the well-known vegetation indices, the following can be noted:
Green Leaf Index (GLI) — used to assess the health and development of
green leaves of the plant cover, physiological state, detection of stress, drying or dam-
age of plants, as well as monitoring of their growth and phenological changes.
Normalized Difference Vegetation Index (NDVI) — measures the health
and density of vegetation on the Earth’s surface. This index is used to assess the
state of ecosystems, monitor the impact of climate change and control land use.
Disease Stress Water Index (DSWI) — used to identify plant diseases,
especially coniferous forests. A decrease in DSWI values indicates a deterioration
of the physiological state of the plant cover, which can be caused by diseases, for
example, infectious diseases or stressful conditions.
а
b
c
Fig. 3. Bands of Sentinel-2 satellite images [7]. Bands with a spatial resolution: 10m (a),
20m (b), 60m (c)
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 90
Chlorophyll Vegetation Index (CVI) — used to estimate the concentration
of chlorophyll in the vegetation cover.
Since vegetation indices are mathematical functions, they can be generalized
by classes of functions. For example, the following classes can be distinguished
for the vegetation indices used in the article [2, Table 2]:
AAB )( : B2, B3, B4, … , B9, B11, B12;
BA
BA
BANORMP
),( : NDWI, NGRDI, NDRE2, NDVI, GNDVI, NDRE3;
B
A
BAFRAC ),( : RDI, PBI, CIG;
)()(
)()(
),,(
CABA
CABA
CBAGLIbased
: GLI;
DC
BA
DCBANORPP
), , , ( : DSWI;
DC
BA
DCBACVIbased
),,,( : CVI;
22 ) ,( BABADIST : DRS.
Set of values of various spectral bands and possible vegetation indices will
be used as input data for building and training models for classifying the state of
the forest into healthy and stressed.
METHODOLOGY
Bhattacharyya distance calculation
to evaluate the informativeness of the features, the bhattacharyya distance will be
used between the distributions of the feature values for the healthy and damaged
forest classes. The larger the value of this distance, the better the feature separates
these classes.
Bhattacharyya distance is calculated by the formula:
)),((ln),( SHBCSHDB ,
where ii
n
i
SHSHBC
1
),( — Bhattacharyya coefficient, where ii SH , — prob-
abilities of the i-th value (the height of the i-th columns of the histograms H and S ).
The value of the Bhattacharyya coefficient is sensitive to the number of his-
togram columns. If their number is too low, the coefficient will be underesti-
mated, if it is too large, it will be overestimated. Because of this, it is reasonable
to take the number of histogram columns equal to the square root of the number
of observations of the stressed class.
Greedy feature selection algorithm
In the previous section, classes of functions were presented, on the basis of which
a large set of vegetation indices can be created. In this section, the most relevant
features from this variety of indices will be defined. Our task is to select a subset
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 91
of features that jointly satisfy two key criteria. First, they should be distinguished
by high values of the Bhattacharyya distance that characterizes their significance.
Secondly, these features should be as independent as possible, i.e. carry as much
new information as possible.
This approach is aimed at building a compact and at the same time informa-
tive subset of features that helps to understand key relationships and features in
the data. This selection of features will simplify the machine learning model and
increase its effectiveness.
A greedy algorithm will be used to form an optimal set of informative and
weakly correlated features. At each step of the work, greedy algorithms choose a
locally optimal solution, which makes it possible to obtain a satisfactory global
approximation in an acceptable time [8]. Heuristics based on the Bhattacharyya
distance and the Spearman correlation coefficient will be used as a criterion for
local optimality of the greedy algorithm.
Spearman correlation coefficient
Spearman correlation coefficient [9] is used to determine the statistical depend-
ence of two variables and shows to what extent the dependence between variables
can be described using a monotonic function. The Spearman correlation coeffi-
cient is defined as the Pearson correlation coefficient for variable ranks, that is, it
does not operate with the values of quantities, but with their serial numbers. Its
values lie within [–1, 1].
This value is defined as
))(())((
))(), ((
),(
yRxR
YRxRcov
YXrs
, where YX , — values
of variables; )(), ( YRXR — transformation of variable values into their ranks;
))(), (( YRXRcov — covariance of rank values; ))(()), (( yRxR — dispersion
of the corresponding rank values.
Independence coefficient
Since the goal is to find the most independent features, a value of 1 should corre-
spond to independent features, and 0 should correspond to fully dependent fea-
tures. Thus, the coefficient of independence will take the form: ),( YXci
),(1 YXrs , where ),( YXrs is the Spearman correlation coefficient.
Proposed algorithm
The algorithm works as follows:
1. The weight of each feature is set equal to the value of the Bhattacharyya
distance of the given feature.
2. The feature with the largest current weight is selected.
3. The weight of each feature is multiplied by the independence coefficient
between the current feature and the feature selected in the previous step.
4. Return to step 2 until the required number of features have been selected
The pseudocode of the algorithm can be seen in Fig. 4. It is important to note
that it uses a modified coefficient of independence, namely, a constant value C is
added to it. Such change allows you to adjust what the algorithm should pay more
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 92
attention to: the independence of features ( 0C ) or the informativeness of fea-
tures ( 0C ).
This approach allows to consistently add to the set the most informative at
the moment and weakly correlated with previous signs. As a result, a set is
formed that contains a maximum of information for a given number of features.
Machine learning models for classification
To compare the effectiveness of different sets of features, machine learning mod-
els will be used for binary classification of the forest state.
Multilayer perceptron [10] with a different number of hidden layers will be
used as a basic classifier. Optimal architecture and hyperparameters of the model
will be tuned using a genetic algorithm.
The models will be trained on a training sample of satellite images with
ground truth masks. To assess the quality of the classification, such metrics as [11]
will be used: overall accuracy (accuracy), Jacquard coefficient (IoU), cross-entropy
(log-loss), area under the ROC curve (ROC AUC) on the validation sample.
Note that the overall accuracy metric makes sense only when assessing the
classification accuracy with a balanced distribution of classes, which occurred on
the validation sample.
Five-fold cross-validation was used to analyze metrics.
Comparing the results of models built on different sets of features allows us
to assess the contribution of the proposed feature selection method to improving
the quality of classification.
EXPERIMENT RESULTS
Description and results of the experiment on evaluating the informativeness
of features
In order to evaluate the informativeness of various features of the image, the
Bhattacharyya distance between the classes of “stressed” and “healthy” conifer-
ous forest was calculated for various spectral channels and vegetation indices.
Sentinel-2 images containing fragments of healthy and stressed coniferous
forest in eastern France were used as input data. Each image contains 12 bands.
Fig. 4. Pseudocode of the proposed algorithm
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 93
For each pixel of each image, 43.644 vegetation indices were calculated based on
12 spectral channels according to the given index classes.
The territories were divided into 3 parts, for each of which Bhattacharyya
distances were calculated separately between the obtained histograms of class
distributions. Histograms were built within the values of the “stressed” class with the
number of columns equal to the square root of the number of pixels of this class.
As a result, average estimates of the Bhagattacharya distance
( )), (( SHDAvg B ) for each of the features were obtained.
Among the original spectral channels, bands B4, B11 and B12 showed the
greatest informativeness (Table 1). Among the known vegetation indices, 7 (RDI,
NDWI, NGRDI, DSWI, NDRE2, NDVI, GLI) demonstrated higher separation
rates (Table 1) compared to spectral channels. A comparison of Bhattacharyya
distance values in vegetation indices from table 1 and the corresponding class
of indices (Table 2) shows that there are instances of the class with a larger
distance value.
T a b l e 1 . Bhattacharyya distance
Sentinel-2 bands Well-known vegetation indices
Band )), (( SHDAvg B Index Formula )), (( SHDAvg B
B12 0.332 RDI
AB
B
8
12 0.581
B4 0.306 NDWI
118
118
BAB
BAB
0.577
B11 0.228 NGRDI
43
43
BB
BB
0.562
B1 0.116 DSWI
114
38
BB
BB
0.513
B9 0.115 NDRE2
57
57
BB
BB
0.470
B5 0.107 NDVI
48
48
BAB
BAB
0.448
B7 0.092 GLI
)23()43(
)23()43(
BBBB
BBBB
0.389
B2 0.073 PDI
3
8
B
B 0.219
B8A 0.068 CIG 1
3
8
B
AB 0.183
B6 0.068 GNDVI
38
38
BAB
BAB
0.182
B8 0.062 NDRE3
78
78
BAB
BAB
0.126
B3 0.041 CVI 23
58
B
BAB
0.051
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 94
T a b l e 2 . The largest Bhattacharyya distance for classes of vegetation indices
Class Instance )), (( SHDAvg B
), , , ( DCBACVIbased
411
63
BB
BB
0.686
), , , ( DCBANORPP
412
62
BB
BB
0.646
), , ( CBAGLIbased
)116()46(
)116()46(
BBBB
BBBB
0.630
), ( BANORMP
126
126
BB
BB
0.609
), ( BAFRAC
12
6
B
B
0.603
), ( BADIST 22 412 BB 0.354
)(AB 12B 0.331
So, the experiment confirmed that the Bhattacharyya distance can be used to
evaluate the informativeness of features, confirmed the advantage of using vege-
tation indices in comparison with the original image data, and showed that for
each class of indices it is possible to find instances with better informativeness
(within the scope of the task) than in well known indices.
This makes it possible to form an effective set of features for further training
of machine learning models without the need for their previous training.
Determination of the optimal set of informative features
The analysis of the results of feature selection showed that among the initial set of
43.644 calculated vegetation indices, 3.128 had a Bhattacharyya distance above
0.4, which indicates their high informativeness.
Using the proposed greedy algorithm, sets of 12 and 24 features were obtained.
The algorithm used a modified coefficient of independence: CYXci ),( , where
0.6C . At this value of the C parameter, the model showed the best result.
As it was found during experiments, the use of classes ), , ( CBAGLIbased ,
), , , ( DCBANORPP , ), , , ( DCBACVIbased during selection leads to a decrease in
metric values, so their use was abandoned. Rejecting them allows you to reduce
the execution time of the algorithm by an order of magnitude.
As a result of this problem, it seems reasonable to search for an effective
combination within each class separately, and then somehow combine them.
However, the identification of the causes of this problem and methods of solvingit
require a separate study.
As can be seen in Fig. 5, the obtained set of features mostly contains features
with relatively little informativeness. This indicates that although a large number
of signs are informative, they are also highly correlated. This is also confirmed by
the fact that the selected features with high informativeness are more correlated
with each other (have a darker color) than the features with low.
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 95
Fig. 5 shows the example of pairwise independence of selected features,
where features are placed from top-left to bottom-right in descending order of
their individual informativeness. Brightness of each cell corresponds to their in-
dependence, where brighter means higher.
Overall, the obtained set of features mostly contains features with relatively
little informativeness. As can be seen in fig. 5, the selected features with high in-
formativeness are more correlated with each other (have a darker color) than the
features with low.
Analysis of machine learning results
To evaluate how successful the sets turned out to be, let’s compare the accuracy
of models built using them compared to models using only spectral channels,
known vegetation indices, and their combination.
Fig. 6 immediately shows that the model built on the features proposed by
the algorithm shows much better knowledge of metrics and their dynamics. Thus,
already after about 7 epochs, the model based on 12 proposed features shows met-
rics close to the metrics of the model based on known vegetation indices. This
comparison confirms that the proposed algorithm is effective.
It should also be noted that the models from the 12 proposed features show
close (but still slightly worse) values of metrics to the model based on spectral
classes. At the same time, the algorithm worked an order of magnitude longer
than model training. Based on this, it can be concluded that the spectral channels
are a good enough basis, and the multilayer perceptron model is able to build a
vegetative index with high informativeness on their basis.
However, when the number of features is doubled, the proposed algorithm
was able to find such a set of features that improves the accuracy of the model
and the rate of its learning. This suggests that the feature sets previously failed to
maximize the amount of information, which one would think, as the model based
on a combination of known vegetation indices and spectral channels showed
almost the same accuracy and learning rate as the model based on spectral chan-
nels alone. And also confirms the usefulness and necessity of feature engineering.
Fig. 5. Independence matrix for (a) — 12, (b) — 24 features selected by the proposed
algorithm among the classes NORMP(A, B), FRAC(A, B), DIST(A, B)
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 96
In the future, the proposed algorithm can be applied to optimize forest state
classification models based on other types of data and for other tasks of remote
sensing of the Earth.
CONCLUSIONS
In this work, the possibility of using Bhattacharyya distance to assess the relative
importance of features in the task of forest state classification based on satellite
images was investigated.
The analysis of real data showed that it makes sense to consider not specific
(known) vegetation indices, but their classes. It was confirmed that within each
class, with a high probability, a vegetative index can be found that is more infor-
mative compared to the known and original Sentinel-2 spectral channels.
The proposed greedy feature selection algorithm based on the Bhattacharyya
distance and the Spearman correlation coefficient made it possible to form a set of
12 features with similar accuracy indicators, and a set of 24 features with signifi-
cantly better ones, compared to the model based only on the spectral channels of
the image.
Therefore, the proposed approach is effective for selecting informative and
weakly correlated features based on satellite images. It can be applied to find an
1
2
3 4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1 —
2 —
3 —
4 —
5 —
Fig. 6. Metrics of built models on the validation sample
Statistical methods of feature engineering for the problem of forest state classification …
Системні дослідження та інформаційні технології, 2024, № 1 97
effective set of features for building machine learning models in forest condition
monitoring tasks and other fields of Earth remote sensing data analysis without
the need to pre-train the models.
REFERENCES
1. N. Kussul, G. Lemoine, J. Gallego, S. Skakun, and M. Lavreniuk, “Parcel based
classification for agricultural mapping and monitoring using multi-temporal satellite
image sequences,” 2015 IEEE International Geoscience and Remote Sensing
Symposium (IGARSS), IEEE, 2015. doi: 10.1109/igarss.2015.7325725.
2. J. Zhang, S. Cong, G. Zhang, Y. Ma, Y. Zhang, and J. Huang, “Detecting Pest-
Infested Forest Damage through Multispectral Satellite Imagery and Improved
UNet++,” Sensors, vol. 22, issue 19, 2022. doi: 10.3390/s22197440.
3. N.N. Kussul, N.S. Lavreniuk, A.Y. Shelestov, B.Y. Yailymov, and I.N. Butko,
“Land Cover Changes Analysis Based on Deep Machine Learning Technique,”
Journal of Automation and Information Sciences, vol. 48, no. 5, pp. 42–54, 2016.
doi: 10.1615/jautomatinfscien.v48.i5.40.
4. T. van Erven, P. Harrëmos, “Rényi divergence and kullback-leibler divergence,”
IEEE Transactions on Information Theory, 60(7), 2014. Available:
https://doi.org/10.1109/TIT.2014.2320500
5. A. Ilnitskiy, O. Burba, “Statistical criteria for assessing the informativity of the
sources of radio emission of telecommunication networks and systems in their rec-
ognition,” Cybersecurity: Education, Science, Technique, 1(5), pp. 83–94, 2019. doi:
10.28925/2663-4023.2019.5.8394.
6. Forest type 2018. Accessed on: April 07, 2023. [Online]. Available:
https://land.copernicus. eu/pan-european/highresolution-layers/forests/forest-type-
1/status-maps/forest-type-2018.
7. “Spatial Resolutions,” Sentinel Online. Accessed on: August 13, 2023. [Online].
Available: https://sentinels.copernicus.eu/web/sentinel/user-guides/sentinel-2-msi
/resolutions/spatial
8. “What is a Greedy Approach? - Algorithms for Coding Interviews in Java,”
educative.io. Accessed on: May 08, 2023. [Online]. Available:
https://www.educative.io/courses/algorithms-coding-interviews-java/3j1R50KnNjQ
9. C. Croux, C. Dehon, “Influence functions of the Spearman and Kendall correlation
measures,” Statistical Methods and Applications, vol. 19, pp. 497–515, 2010. doi:
10.1007/s10260-010-0142-z.
10. P.M. Atkinson, A.R. Tatnall, “Introduction neural networks in remote sensing,”
International Journal of Remote Sensing, vol. 18(4), 1997. doi:
10.1080/014311697218700.
11. “Metrics for semantic segmentation,” ilmonteux.github.io. Accessed on: May 27, 2023.
[Online]. Available: https://ilmonteux.github.io/2019/05/10/segmentation-metrics.html
Received 06.09.2023
INFORMATION ON THE ARTICLE
Yevhenii V. Salii, ORCID: 0009-0006-0395-8099, Educational and Research Institute of
Physics and Technology of the National Technical University of Ukraine “Igor Sikorsky
Kyiv Polytechnic Institute”, Ukraine, e-mail: yevhenii.salii@gmail.com
Y.V. Salii, A.M. Lavreniuk, N.M. Kussul
ISSN 1681–6048 System Research & Information Technologies, 2024, № 1 98
Alla M. Lavreniuk, ORCID: 0000-0002-5791-0377, Educational and Research Institute
of Physics and Technology of the National Technical University of Ukraine “Igor
Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: alla.lavrenyuk@gmail.com
Nataliia M. Kussul, ORCID: 0000-0002-9704-9702, Educational and Research Institute
of Physics and Technology of the National Technical University of Ukraine “Igor
Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: nataliia.kussul@gmail.com
СТАТИСТИЧНІ МЕТОДИ ІНЖЕНЕРІЇ ОЗНАК ДЛЯ ЗАДАЧІ КЛАСИФІКАЦІЇ
СТАНУ ЛІСІВ ЗА СУПУТНИКОВИМИ ДАНИМИ / Є.В. Салій, А.М. Лавренюк,
Н.М. Куссуль
Анотація. Своєчасне виявлення хвороб лісу є важливим завданням для запобі-
гання їх поширенню та обмеження наслідків. Використання супутникових
зображень надає можливості для великомасштабного моніторингу лісів. Моделі
машинного навчання дають змогу автоматизувати аналіз цих даних для вияв-
лення аномалій, що можуть свідчити про наявність хвороб. Відбір інформати-
вних ознак є ключовим етапом побудови ефективної моделі. Досліджено мож-
ливість застосування відстані Бгаттачар’я та коефіцієнта кореляції Спірмена
для відбору ознак із супутникових зображень. Застосовано жадібний алгоритм
для формування підмножини слабко корельованих ознак. Експеримент пока-
зав, що обрані ознаки дозволяють покращити якість класифікації порівняно
з використанням усіх спектральних каналів. Запропонований підхід продемон-
стрував ефективність для відбору інформативних і слабко корельованих ознак
та може застосовуватися в інших задачах дистанційного зондування Землі.
Ключові слова: Sentinel-2, вегетаційні індекси, відстань Бгаттачар’я, інжене-
рія ознак, жадібні алгоритми, коефіцієнт кореляції Спірмена.
|
| id | journaliasakpiua-article-286178 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2025-07-17T10:28:20Z |
| publishDate | 2024 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/b1/91525c20d84f95f10869a0d6c1a9ccb1.pdf |
| spelling | journaliasakpiua-article-2861782024-05-23T07:09:36Z Statistical methods of feature engineering for the problem of forest state classification using satellite data Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними Salii, Yevhenii Lavreniuk, Alla Kussul, Nataliia Sentinel-2 вегетаційні індекси відстань Бгаттачар’я інженерія ознак жадібні алгоритми коефіцієнт кореляції Спірмена Sentinel-2 vegetation indices Bhattacharyya distance feature engineering greedy algorithms Spearman’s rank correlation coefficient Timely detection of forest diseases is an important task for their prevention and spread limitation. The usage of satellite imagery provides capabilities for large-scale forest monitoring. Machine learning models allow to automate the analysis of these data for anomaly detection indicating diseases. However, selecting informative features is key to building an effective model. In this work, the application of Bhattacharyya distance and Spearman’s rank correlation coefficient for feature selection from satellite images was investigated. A greedy algorithm was applied to form a subset of weakly correlated features. The experiment showed that selected features allow for improving the classification quality compared to using all spectral bands. The proposed approach demonstrates effectiveness for informative and weakly correlated feature selection and can be utilized in other remote sensing tasks. Своєчасне виявлення хвороб лісу є важливим завданням для запобігання їх поширенню та обмеження наслідків. Використання супутникових зображень надає можливості для великомасштабного моніторингу лісів. Моделі машинного навчання дають змогу автоматизувати аналіз цих даних для виявлення аномалій, що можуть свідчити про наявність хвороб. Відбір інформативних ознак є ключовим етапом побудови ефективної моделі. Досліджено можливість застосування відстані Бгаттачар’я та коефіцієнта кореляції Спірмена для відбору ознак із супутникових зображень. Застосовано жадібний алгоритм для формування підмножини слабко корельованих ознак. Експеримент показав, що обрані ознаки дозволяють покращити якість класифікації порівняно з використанням усіх спектральних каналів. Запропонований підхід продемонстрував ефективність для відбору інформативних і слабко корельованих ознак та може застосовуватися в інших задачах дистанційного зондування Землі. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2024-03-29 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/286178 10.20535/SRIT.2308-8893.2024.1.07 System research and information technologies; No. 1 (2024); 86-98 Системные исследования и информационные технологии; № 1 (2024); 86-98 Системні дослідження та інформаційні технології; № 1 (2024); 86-98 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/286178/296362 |
| spellingShingle | Sentinel-2 вегетаційні індекси відстань Бгаттачар’я інженерія ознак жадібні алгоритми коефіцієнт кореляції Спірмена Salii, Yevhenii Lavreniuk, Alla Kussul, Nataliia Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title | Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title_alt | Statistical methods of feature engineering for the problem of forest state classification using satellite data |
| title_full | Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title_fullStr | Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title_full_unstemmed | Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title_short | Статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| title_sort | статистичні методи інженерії ознак для задачі класифікації стану лісів за супутниковими даними |
| topic | Sentinel-2 вегетаційні індекси відстань Бгаттачар’я інженерія ознак жадібні алгоритми коефіцієнт кореляції Спірмена |
| topic_facet | Sentinel-2 вегетаційні індекси відстань Бгаттачар’я інженерія ознак жадібні алгоритми коефіцієнт кореляції Спірмена Sentinel-2 vegetation indices Bhattacharyya distance feature engineering greedy algorithms Spearman’s rank correlation coefficient |
| url | https://journal.iasa.kpi.ua/article/view/286178 |
| work_keys_str_mv | AT saliiyevhenii statisticalmethodsoffeatureengineeringfortheproblemofforeststateclassificationusingsatellitedata AT lavreniukalla statisticalmethodsoffeatureengineeringfortheproblemofforeststateclassificationusingsatellitedata AT kussulnataliia statisticalmethodsoffeatureengineeringfortheproblemofforeststateclassificationusingsatellitedata AT saliiyevhenii statističnímetodiínženerííoznakdlâzadačíklasifíkacíístanulísívzasuputnikovimidanimi AT lavreniukalla statističnímetodiínženerííoznakdlâzadačíklasifíkacíístanulísívzasuputnikovimidanimi AT kussulnataliia statističnímetodiínženerííoznakdlâzadačíklasifíkacíístanulísívzasuputnikovimidanimi |