Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами
This article provides a brief overview of a set of the most common basic object detection neural network models. Today, the need for automating surveillance and observation processes remains a growing trend. Moreover, one of the key tasks of such processes is usually the detection of an object of in...
Збережено в:
| Дата: | 2025 |
|---|---|
| Автори: | , |
| Формат: | Стаття |
| Мова: | Англійська |
| Опубліковано: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2025
|
| Теми: | |
| Онлайн доступ: | https://journal.iasa.kpi.ua/article/view/351422 |
| Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
| Назва журналу: | System research and information technologies |
| Завантажити файл: | |
Репозитарії
System research and information technologies| _version_ | 1866303056674553856 |
|---|---|
| author | Shvandt, Maksym Moroz, Volodymyr |
| author_facet | Shvandt, Maksym Moroz, Volodymyr |
| author_sort | Shvandt, Maksym |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2026-02-02T20:49:24Z |
| description | This article provides a brief overview of a set of the most common basic object detection neural network models. Today, the need for automating surveillance and observation processes remains a growing trend. Moreover, one of the key tasks of such processes is usually the detection of an object of interest for further analysis. Previously, many basic object detection algorithms and approaches have been proposed; however, most of them typically have limitations in terms of their applicability. In most cases, these limitations arise due to the nature of the observed environment or because the detection approaches rely on specific object characteristics, such as color or basic shapes only. To address these problems, a new approach for object detection has been developed using neural networks. This paper presents the basis and central aspects of the most common neural network object detection models. The experiment has demonstrated the features, advantages, and disadvantages of the studied methods in the application case of lab animal detection during their behavioral study. Considering this, conclusions and recommendations for their usage cases were made. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2025.4.05 |
| first_indexed | 2026-02-08T08:06:11Z |
| format | Article |
| fulltext |
M.A. Shvandt, V.V. Moroz, 2025
Системні дослідження та інформаційні технології, 2025, № 4 71
TIДC
МЕТОДИ, МОДЕЛІ ТА ТЕХНОЛОГІЇ
ШТУЧНОГО ІНТЕЛЕКТУ В СИСТЕМНОМУ
АНАЛІЗІ ТА УПРАВЛІННІ
UDC 004.932:519.652
DOI: 10.20535/SRIT.2308-8893.2025.4.05
OVERVIEW OF NEURAL NETWORK OBJECT DETECTION
METHODS & MODELES ON THE EXAMPLE OF THEIR USE
FOR LAB ANIMAL OBSERVATION
M.A. SHVANDT, V.V. MOROZ
Abstract. This article provides a brief overview of a set of the most common basic
object detection neural network models. Today, the need for automating surveillance
and observation processes remains a growing trend. Moreover, one of the key tasks
of such processes is usually the detection of an object of interest for further analysis.
Previously, many basic object detection algorithms and approaches have been pro-
posed; however, most of them typically have limitations in terms of their applicabil-
ity. In most cases, these limitations arise due to the nature of the observed environ-
ment or because the detection approaches rely on specific object characteristics,
such as color or basic shapes only. To address these problems, a new approach for
object detection has been developed using neural networks. This paper presents the
basis and central aspects of the most common neural network object detection mod-
els. The experiment has demonstrated the features, advantages, and disadvantages of
the studied methods in the application case of lab animal detection during their be-
havioral study. Considering this, conclusions and recommendations for their usage
cases were made.
Keywords: object detection, neural network, neural layer, architecture, model, op-
timization, estimation, prediction, video, image, frame, background, foreground, ex-
periment, comparison.
INTRODUCTION
Since the 1990s the fast advancement of computers alongside with the strong de-
velopment of computer sciences has led to wide automatization of many everyday
processes and procedures of our life. From that time and up until today the visual
analysis [1] became one of the most used technologies and it is applied every-
where from pedestrian and traffic control to war operations and factory produc-
tion. In general object detection and tracking usually play important roles in visual
analysis. The tasks usually require to detect some object of interest and to track it con-
sequently from frame to frame on either prerecorded video or from some life stream-
ing directly in order to perform some analysis of that object or its behavior.
Object detection and object tracking generally are two separate tasks that
require its own special approaches and methods. While some common basic
object detection and tracking methods had already been considered in previous
articles [2; 3], this time we will take a look on more complex way of object
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 72
detection itself. As already mentioned in many cases it can be necessary to detect
a specific type of object that has both specific colors and shape/structure.
Searching for it with such approaches as, for example, the template matching [2]
is not a good option because such operation does not work well with different
object scaling and rotations since it requires multiple comparisons of given
template with its multiple scaling and rotations in order to find the best matches
with objects on image. Such search is not very efficient in terms of performance
per frame and is unlikely to be used especially on live video streams of high
resolution. Also if object tends to change its shape from frame to frame even
slightly, it will affect the comparison with the template and probably will require
more templates to check to match each “new” shape. This approach also works
mainly with object shape, thus color checking remains as second problem to solve
with this approach. Neural detector can come in handy in such cases, as the model
coefficients can be trained to recognize a multiple shapes and colors of some
particular object. In general the only possible difficulty here can be providing a
good dataset for training as it should contain images where the desired type of
object can be clearly seen and not mismatched with the background.
One of many processes that can require surveillance and observation auto-
matization is biological research. It often involves the study of life processes of
multiple lab animals, for example mice and fish [2; 3]. Usually animals are put in
specific conditions so they are easier to observe and note on their behavior. But
doing it manually is a time-consuming process that can be automated to save lab
personnel some time. The particular case of animal study is gobies behavior ob-
servation (Fig. 1). The gobies are kept in a square aquarium with a camera placed
right above it recording all their movements during the day to understand the as-
pects of their activity. While the development of a complex tracking program for
its tracking and behavior study is currently being developed, it requires an object
detector based on a neural network in order to enhance fish position localization.
At this stage in order to choose the best performing detector from the set of most
common open-for-use model an experiment was carried to learn which model
suits most as such object finder so it can be later integrated into main detection
and tracking algorithm. The experiment showed their algorithmic aspects, advan-
tages and disadvantages in case of their application in such test conditions like ours.
This analysis might be useful for anyone who plans to use these models in similar
conditions as ours and is presented further in this paper.
THE PROBLEM OF NEURAL NETWORK OBJECT DETECTION
The selected models were chosen according to two main criteria. The first is the
hardware requirements in terms of performance. The usage of many computational
Fig. 1. Lab fish (gobies) in the study environment (a, b)
a b
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 73
algorithms on practice is often limited by the hardware it is running on. As fish video
analysis is intended to be performed on-site the resulting detection and tracking algo-
rithm should be able to deliver fine performance on usual mass-market hardware in-
stead of cloud servers or big mainframes.
The second criteria also decided to be considered is the possibility to train
the model on local mass-market hardware as well. During the experiment (is
shown later in this article) it was found out that some model versions could not be
trained locally with minimum sufficient image batch size. As for the model ac-
ceptable performance it was decided that the batch size of 1 or just 2 images could
lead to model poor training.
The third criteria is model detection speed vs accuracy ratio. The models’
mAP was evaluated previously [4] with COCO evaluator [5; 6]. While the accu-
racy is an important feature, the detection speed is also very significant. Since
each frame will be additionally preprocessed to remove noise and enhance other
color characteristics which will take additional time, running detection on a single
frame should not exceed some reasonable time limits as overall video processing
should not become several times longer as the video itself. Considering it we took
the model versions with highest claimed accuracy that did not exceed the detec-
tion time threshold of about 100–110 ms.
An additional difficulty of this experiment is that the objects of interest are
gobies which being filmed as mentioned above do not visually contain many sig-
nificant marks or features compared to other object like cars, other bigger animals
or buildings. Thus the lack of visually distinguishable features makes both net-
work training and usage more challenging.
CenterNet architecture. The first considered model architecture is the Cen-
terNet. In order to estimate a bounding box of the searched object and to classify
it there are two approaches in the Anchor Free Object Detection: the Keypoint-
based approach and the Center-based approach. The Keypoint-based approach
assumes the network predicts the predefined key points and then they are used for
bounding box generation around the object and its classification. Examples of
such architectures are CenterNet: Keypoint Triplets [7], CornerNet [8], Grid-
RCNN [9]. The Center-based approach [7; 10; 11] uses center-point or any part-
point of an object to define positive and negative samples. Then it predicts the
distance from these positives to four coordinates for the generation of a bounding
box. For example such methods as DenseBox [12], FCOS [13], etc. generate posi-
tive samples and use them for estimation of boxes and class probabilities.
As for the CenterNet, the main research [10] treats the center of a box as an
object as well as a key point. Then it uses this predicted center to detect the coor-
dinates/offsets of the bounding box. Thus the center prediction task is considered
as a standard problem for keypoint estimation. When image gets passed through
Fully Convolutional Network, the final feature map provides as an output heat-
maps for different key points. The peaks of these output feature maps are consid-
ered as predicted centers. The network also makes predictions of the width and
height of the box for these centers with each center having its unique box width
and height. This binding is intended for removing of the Non-Maximal Suppres-
sion step in post-processing. The heatmap peaks are also linked to a particular class to
which it belongs to and thus it allows object classification, as using these centers, di-
mensions, and class probabilities, object detection task is achieved.
In general, the CenterNet architecture works in the following way. The input
image I having width and height as W and H respectively, and 3 channels for
RGB. R is an output stride that will set the resulting dimensions of the given
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 74
heads. All the heads will have the same height H/R and width W/R, but they will
have different C values (depth of the keypoint heatmap). Thus the final head di-
mensions are (W/R, H/R, C=[<Classes Num>/<2>/<2>]). Thus if input dimen-
sions are 512×512, then head dimensions are 128×128 considering stride 4R
(Fig. 2, a). The three heads as shown in Fig. 2, a are Heatmap Head, Dimension
Head, Offset Head.
Heatmap Head is used for the key points estimation of the given input im-
age. In the case of object detection, keypoints are the box center. One has to pre-
dict heatmap Ŷ of dimensions (W/R, H/R, C), with R being the output stride, C
is the number of classes; Ŷ is the functssion of x, y, c. A prediction 1),,(ˆ cyxY
corresponds to detected center for that particular class c. 0),,(ˆ cyxY is consid-
ered as background. For the loss propagation ground truth heatmaps calculation,
these centers are splat using Gaussian Kernels after converting them to low-
resolution equivalent (division by stride R, denoted as p~ ). For example, in case of
three classes 3C and input image dimensions of 400×400, with a given stride
4R it is necessary to generate 3 heatmaps (as each heatmap corresponds to a given
class) of 100×100 dimension as shown in Fig. 2, b. The value used in the kernel is
the object-size adaptive standard deviation. Also, if two gaussians of the same class
are overlapping, they take element-wise maximum to find the target class.
a
b
Fig. 2. a — Three heads are predicted after one forward pass from the network architec-
ture: Offset Head, Heatmap Head. Dimension Head. Here Some Architecture (FCN) refe-
rees to any of the feature extractors which we want to use (specified heads are for object
detection; b — Left: Ground Truths of different classes, shown in different colors; Right:
three centers of respective classes splat into heatmaps using Gaussian Kernel) (source:
medium.com/visionwizard [11])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 75
2
22
2
)~()~(
exp
p
yy
xyc
pypx
Y .
Dimension Head is used for the estimation of the dimensions of the boxes
width and height. With given box coordinates 2211 ,,, yxyx of object k and class
c, one can regress object sizes ),( 1212 yyxxsk . This is achieved by solving
a standard 1L distance norm. Dimensions of this heatmap are (W/R, H/R, 2), with
w being h are predicted width and height of the box. To reduce the amount of
computation, single sized heatmaps for all object categories are used. Offset Head
is used to recover from the discretization error caused due to the downsampling of
the input. After the center points prediction, one has to map these coordinates to
an input image of higher dimension. Since the original image pixel indices are
integer values this will cause a value disturbance because one will be predicting
the float values. So to solve this issue they make predictions the local offsets Ô ,
as these local offset alues are shared between objects on an image. Offset Head
dimensions are (W/R, H/R, 2) (here x and y are the coordinate offsets). The overall
detection flow can be seen on Fig. 3, a, b.
As a Feature Extractor CenterNet can use a variety of backbone/feature
extraction approaches [8; 10; 11]. With our research we have considered the
following ones: Stacked Hourglass Network [14] (Hourglass104 version),
Residual Network [15] (ResNet101 V1 FPN) and the MobileNet [16] (MobileNet
V2/V1 FPN). For instance the stacked Hourglass Network downsamples the input
by 4×, then followed by two sequential hourglass modules, with each hourglass
module being made up of a uniform chain of 5-layer down- and up-convolutional
network with skip connections. The original paper [10] also used modified
(Fig. 3, c) ResNet18, ResNet1, Deep Layer Aggregation Networks (DLA) [17]
with added Deconvolutional and Deformable Convolutional Layers. Standard
ResNet modules were extended with three transposed convolutional networks to
incorporate higher resolution outputs. Some modifications were done by reducing
the output upsampling layers’ filters of to 256, 128, and 64 respectively in order
to reduce computation. The authors also added an additional 3×3 deformable
convolutional layer between each of upsampling layers led to better results on
some standard datasets [10; 11].
The main part of CenterNet algorithm is Loss Calculation and Propagation.
After heatmaps are generated by the network there is a task of loss propagation
for training stabilization. In the original paper [8], authors use several loss
functions to get over and balance the bias between the training of different heads.
There are three Loss functions mentioned: Heatmap Variant Focal Loss, L1 Norm
Offset Loss and L1 Norm Dimension Size Loss. For Heatmap Variant Focal Loss the
Focal Loss function [18; 19] is divided into two parts of positive and negative
samples:
xyc xycxycxyc
xycxycxyc
k
YYY
YYY
N
L
.otherwise)ˆ1(log)ˆ()ˆ1(
,1if)ˆ(log)ˆ1(1
If 1Y when predicted Ŷ is close to 1 (ex. 95.0ˆ Y ), it considers as an
easy example (well-classified example) and thus by the logic of Focal Loss the
weightage of the propagated loss will be decreased. The same logic is for hard
examples (misclassified example) with a difference that instead of decreasing the
weight, it will increase the slope of the value by parameter (here 2: ). If
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 76
Y != 1 (Otherwise) with predicted Ŷ being very close to 0 (ex. 005.0ˆ Y ), then
Ŷ will cause the overall loss to be zero, and less weight will be assigned to the
propagated loss as stated in the premise of Focal Loss [18]. The particular case is
when Ŷ is not very close to 0 and has a value near to 1, but it is in the neighbor-
hood of the ground truth heatmap. As the ground truths are the gaussian kernel
outputs there is no sudden drop in the values near Y=1. It considers values, lying
inside gaussian outputs, as possible positives. This is an advantage of this loss.
For example, let 9.0ˆ Y and being near to the center point peak of ground truth.
Here a misclassification takes place as the value should be very near to 0
according to simple logistic regression loss logic. But, as predicted 1ˆ Y , the
propagated loss will be less weighted even in a condition of misclassification as
a b
Fig. 3. Training — a, b; Inference flowchart of the explained object detection algorithm —
c; (here: (a) Stacked Hourglass Network, (b) ResNets with Transposed and Deformable
Convolution layers, (c) Original DLA-34, (d) Modified DLA-34 by adding skip connec-
tions and Deformable Convolutional Layers.) (source: medium.com/ visionwizard [11])
c
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 77
the loss will be compensated according to the term )1( Y (value of Y will be
close to 1 in a region near center peak). In case when one has 9.0ˆ Y and being
far from the center point peak, in this condition of misclassification a large loss
will be propagated according to term )1( Y as it does not placed in that splatted
region, and the value of Y will be very close to 0. Here 4: . The design of this
loss function helps to increase the number of positive examples by considering
the heatmap values generated by gaussian kernels which, in its turn, help to
decrease the bias between positives and negatives.
The L1 Norm Offset Loss is a simple L1 Norm of the predicted offset Ô and
the ground truth offset values:
p
poff p
R
p
O
N
L ~ˆ1
~ .
The meaning of ground truth offset values can be seen on the example: if
there is a center point at (18, 22) in an original high-resolution image, when
downsampled, with stride size of 4, the mapped coordinates will be (4, 5) on a
low-resolution feature map. Here is an offset error of 0.5 in both cases. In the case
of keypoint estimation, it becomes important to handle this problem as keypoints
are very position sensitive. In order to solve this task the offset loss function is
added in order to obtain more accurate results. This supervision only acts at the
position of key points, all other locations are ignored.
The L1 Norm Dimension Size Loss of the predicted and ground truth width-
height coordinates is used for Regression of the width and height of bounding
boxes. Here Ŝ are the predicted dimensions and s are actual ground truth sizes.
Raw pixel values are used to calculate the loss instead of normalizing with the
feature map size
N
k
kpsize sS
N
L
k
1
ˆ1
.
Total Loss propagated by the network is shown in formula,
offset
size
1.0
.
The Total Loss of CenterNet: offoffsizesizek LLLL det .
The example of object detection process is visualized on Fig. 4: during infer-
ence process one calculates the peaks of the heatmaps by finding the maximum
value near the 8-pixel neighborhood in a heatmap and keeping the first 100 peaks
of all the different classes independently. It is achieved by 3×3 MaxPool opera-
Fig. 4. Left: keypoint heatmap; middle: keypoint offsets; right: dimensions of box
(source: original paper [10])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 78
tion on the resulting feature map with the obtained peak coordinates being used to
calculate the dimensions and offset predictions.
The Residual Network. As mentioned earlier, CenterNet can use several
different backbone architectures for feature extraction. We have considered the
models using Residual Network, MobileNet and Stacked Hourglass Network. The
first one used in studied models is the Residual Network architecture [15; 20; 21].
The Residual Network architecture itself is based on a concept of Residual
Blocks. These blocks were designed to address the issue of the
vanishing/exploding gradient. Inside ResNet a method known as skip connections
is applied. This method skips (bypasses) some levels in between link-layer
activations to subsequent layers and thus creates a leftover block (Fig. 5, a).
These leftover blocks are used in stacks to create residual nets (Fig. 6). The main
idea of such architecture is to let the network fit the residual mapping instead of
having layers learn the underlying mapping and thus, let the network fit instead of
using, for example, the initial mapping of )(xH :
xxFxHxxHxF )(:)()(:)( .
Thanks to this skip link, the regularization will skip any layer that worsens
the architecture performance and as the training of a very deep neural network
becomes possible without getting issues with vanishing or expanding gradients.
The general purpose of ResNet is following: in the Deep Neural Networks extra
layers are stacked in order to improve accuracy and performance, often to handle
a challenging problem. The main idea of layering is that by adding additional lay-
ers they will eventually learn features that are more complicated. Ex an example
one can take photographs recognition: when recognizing photographs, the first
layer may pick up on edges, the second — textures, the third — objects, and so
on. However, the traditional convolutional neural network model was found to
have the maximum depth threshold. The graphic (Fig. 5, b) shows the percentage
of errors for training and test data for a 20-layer network and a 56-layer network,
respectively.
In both the training and testing situations, we see the higher error percentage
for a 56-layer network in comparison with a 20-layer network. It demonstrates
that adding additional layers on top of the network will decline its performance.
This might be because of with the initialization of the network, the optimization
function, and most significantly — because of the vanishing gradient problem. In
this case overfitting is not the issue as the 56-layer network’s error percentage is
the worst on both training and test data, and it does not happen when the model is
overfitting.
Fig. 5. a — Skip connection or Shortcut; b — comparison of 26-layer vs 56-layer archi-
tecture (source: medium.com/siddheshb008 [20], original paper [15])
x
x F(x)
F(x)+x
a b
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 79
The ResNet architecture exists in several configurations. For example it can
use the VGG-19-inspired 34-layer plain network architecture that is followed by
the addition of the shortcut connection. It is subsequently transformed into the
residual network by these short-cut connections, as depicted in Fig. 6. Keras, an
open-source, Python-based neural network framework offers ResNet V1 and
ResNet V2 with 50, 101, or 152 layers: ResNet50, ResNet101, ResNet152, Res-
Net50V2, ResNet101V2, ResNet152V2; ResNetV2 and the original ResNet (V1)
vary primarily in that V2 applies batch normalization before each weight layer.
As a brief conclusion, it can be said that ResNet is a robust backbone model and
can be used in various computer vision tasks.
Feature Pyramid Network (FPN). Object detection in different scales is a
very difficult task, especially for small objects. The pyramid of the same image at
different scale can be used to detect objects (Fig. 7, a) but processing multiple
scale images too demanding in terms of time and memory to be trained end-to-
end simultaneously. That is why it can be used only in inference in order to in-
crease accuracy as high as possible, in particular for cases, when speed is not a
concern. As an alternative a feature pyramid can be created and used for object
detection (Fig. 7, b), but the feature maps, which are closer to the image layer, are
composed of low-level structures that are not good for accurate object detection.
The Feature Pyramid Network itself is a feature extractor [22–25] created for the
described pyramid approach that considers performance speed and accuracy as
well. It is used instead of such the feature extractor as Faster R-CNN [26] and
generates multi-scale feature maps (multiple feature map layers) and delivers in-
formation of better quality than the regular feature pyramid for object detection.
a b
Fig. 7. a — pyramid of images; b — pyramid of feature maps (source:
medium.com/jonathan-hui [23], original paper [22])
Fig. 6. Resnet34 Architecture (source: medium.com/siddheshb008 [20], original paper [15])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 80
The FPN data flow composes of a bottom-up and a top-down pathway (Fig. 8,
a). The bottom-up pathway is represented by a usual convolutional network. Dur-
ing it the features are being extracted: as one goes up, the spatial resolution de-
creases. After detecting more high-level structures, the semantic value for each
layer increases (Fig. 8, b). The Single Shot MultiBox Detector (SSD) [27] calcu-
lates detection from multiple feature maps, but it does not select the bottom layers
for object detection (Fig. 8, c). Despite being in high resolution their semantic
value is not high enough to use it as the speed slow-down is significant. Because
of that the SSD uses only upper layers for detection and thus its performance is
much worse for small objects. The FPN uses a top-down pathway to construct
higher resolution layers from a semantic rich layer (Fig. 8, d).
The reconstructed layers are semantically strong but the objects are not
located precisely after all the downsampling and upsampling operations. In order
to enhance object location prediction, the lateral connections between
reconstructed layers and the corresponding feature maps were added. It also helps
to simplify the training as it also acts as skip connections. Similar approach is
used in ResNet [15]. For the bottom-up stage the ResNet is used. The bottom-up
pathway consists of many convolution modules iConv , ]5,1[i , with each
module composing of multiple convolution layers. Also during this stage, one
reduces the spatial dimension by 1/2 (i.e. double the stride) with labeling the
output of each convolution module as iC which are later used during in the top-
down pathway (Fig. 9, a). In the process of top-down pathway one applies a 1×1
convolution filter in order to reduce 5C channel depth to 256-d to obtain 5M .
Thus, one receives the first feature map layer that will be used for object
prediction. With each step down further one upsamples the previous layer by 2
using nearest neighbors upsampling. Again a 1×1 convolution is applied to
corresponding feature maps and then they are added element-wise. A 3×3
convolution is applied to all merged layers; this convolution filter is used for
reducing the aliasing effect during merging operation with the upsampled layer
(Fig. 9, b). The same process is repeated for the pyramid feature maps 23 , PP , but
it is stopped at 2P because the spatial dimension of 1C is too large. If it is
a b
c d
Fig. 8. a — FPN data flow; b — feature extraction in FPN; c — SSD object detection
with top levels; sd — FPN top-down pathway (source: medium.com/jonathan-hui [23];
original paper [22])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 81
continued, it will slow down the process too much. As one shares the same
classifier and box regressor of every output feature maps, all pyramid feature
maps 2345 ,,, PPPP have 256-d output channels.
As for object detection, FPN is not an object detector by itself. This
architecture is a feature extractor that works with object detectors. It is used for
feature maps extracting and later feeding them into some detector, for example
Region Proposal Network (RPN). RPN then applies a sliding window over those
feature maps to predict the objectness (i.e. whether there is an object or not) and
object boundary box at each location (Fig. 9, c). In the FPN framework, for each
scale level, for example 4P or 3P , one applies 3×3 convolution filter over the
feature maps and after that applies separate 1×1 convolution for predictions of
objectness and boundary box regression. These 3×3 and 1×1 convolutional layers
are called the RPN head (Fig. 9, d).
MobileNet architecture. The third considered architecture is MobileNet.
This architecture was developed by Google in 2017 [16; 28]. It utilizes the
approach called Depthwise Separable Convolution in order to reduce the model
size and complexity. This architecture was primarily created for use in mobile and
embedded vision applications (Fig. 10, a). It has following benefits: smaller model
size (fewer number of parameters) and smaller complexity (fewer multiplications and
additions, aka Multi-Adds). To make MobileNet easy to tune, two parameters were
introduced: Width Multiplier and Resolution Multiplier .
The Depthwise separable convolution is a depthwise convolution that is fol-
lowed by a pointwise convolution (Fig. 10, b); the Depthwise convolution is the
channel-wise KK DD spatial convolution. For example, if one has 5 channels,
then there are 5 KK DD spatial convolutions. The Pointwise convolution
actually is the 1×1 convolution intended to change the dimension. Combined with
the Depthwise Convolution, the operation cost is:
a
b
c
d
Fig. 9. a — ResNet for FPN bottom-up and top-down pathways; b — feature merging
operation during top-down pathway; c — FPN usage with RPN; d — RPN head (source:
medium.com/jonathan-hui [23])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 82
FFFFKK DDNMDDMDD ,
where the left part of sum is the Depthwise Convolution Cost and the right one is
the Pointwise Convolution Cost. Here M is the number of input channels, N is the
number of output channels, KD is kernel size, FD is the feature map size. For
Standard Convolution, its cost is
FFKK DDNMDD .
Thus, the Depthwise Separable Convolution Cost / Standard Convolution Cost is:
2
11
KFFKK
FFFFKK
DNDDNMDD
DDNMDDMDD
.
When KK DD is 3×3, the amount of
computation can be reduced from 8 to 9
times, but with only small reduction in accu-
racy. The Table 1 shows the architecture of
MobileNet; the Batch Normalization (BN)
and ReLU are applied after each convolution
(Fig. 11), with Width Multiplier being in-
troduced for controlling of the number of
channels or channel depth, which makes M
become .M Thus, the Depthwise Separable
Convolution cost (with Width Multiplier )
is:
FFFFKK DDNMDDMDD ,
a
b
Fig. 10. a — MobileNets usage in practice; b — Depthwise separable convolution
(source: towardsdatascience.com; original paper [16; 28])
Fig. 10. Left: Standard Convolution,
right: Depthwise separable convolution
(Right) With BN and ReLU (source:
original paper [16])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 83
where ]1,0[ , with typical settings of 1, 0.75, 0.5 and 0.25. With 1 , it is the
basic MobileNet, and the computational cost and the number of parameters can be
reduced quadratically by 2 . The Resolution Multiplier is introduced to
control the input image resolution of the network and thus the Depthwise Separa-
ble Convolution Cost with Both Width Multiplier and Resolution Multiplier is:
FFFFKK DDNMDDMDD ,
with ]1,0[ and the input resolution of 224, 192, 160, and 128. With 1 , it is
the basic MobileNe.
T a b l e 1 . MobileNet Body Architecture (source: original paper [16])
Type / Stride Filter Shape Input Size
Conv / s2 3 × 3 × 3 × 32 224 × 224 × 3
Conv dw / s1 3 × 3 × 32 dw 112 × 112 × 32
Conv / s1 1 × 1 × 32 × 64 112 × 112 × 32
Conv dw / s2 3 × 3 × 64 dw 112 × 112 × 64
Conv / s1 1 × 1 × 64 × 128 56 × 56 × 64
Conv dw / s1 3 × 3 × 128 dw 56 × 56 × 128
Conv / s1 1 × 1 × 128 × 128 56 × 56 × 128
Conv dw / s2 3 × 3 × 128 dw 56 × 56 × 128
Conv / s1 1 × 1 × 128 × 256 28 × 28 × 128
Conv dw / s1 3 × 3 × 256 dw 28 × 28 × 256
Conv / s1 1 × 1 × 256 × 256 28 × 28 × 256
Conv dw / s2 3 × 3 × 256 dw 28 × 28 × 256
Conv / s1 1 × 1 × 256 × 512 14 × 14 × 256
Conv dw / s1
5×
Conv / s1
3 × 3 × 512 dw
1 × 1 × 512 × 512
14 × 14 × 512
14 × 14 × 512
Conv dw / s2 3 × 3 × 512 dw 14 × 14 × 512
Conv / s1 1 × 1 × 512 × 1024 7 × 7 × 512
Conv dw / s2 3 × 3 × 1024 dw 7 × 7 × 1024
Conv / s1 1 × 1 × 1024 × 1024 7 × 7 × 1024
Avg Pool / s1 Pool 7 × 7 7 × 7 × 1024
FC / s1 1024 × 1000 1 × 1 × 1024
Softmax / s1 Classifier 1 × 1 × 1000
Later the modified MobileNet version was introduced, called V2 [29; 30]. It
utilizes the inverted residual structure. In this modification the non-linearities in
narrow layers are removed. The difference between V1 and V2 can be briefly
described in the following way. The MobileNet V1 has 2 layers, with the first
layer, called depthwise convolution, performing lightweight filtering by applying
a single convolutional filter per input channel, and the second layer, called
pointwise convolution, being a 11 convolution and used for building new fea-
tures through calculating linear combinations of the input channels.
The MobileNet V2 has two types of blocks. One is residual block with stride
of 1 and another one is block with stride of 2 used for downsizing; the model has
3 layers for both types of blocks. In this version the first layer is 11 convolution
with ReLU6, the second layer is the depthwise convolution and the third layer is
another 1x1 convolution but without any non-linearity (Table 2). It is also claimed
that with using ReLU again, the deep networks only have the power of a linear
classifier on the non-zero volume part of the output domain.
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 84
T a b l e 2 . MobileNet V2 layers (source: original paper [29])
Input Operator Output
h × w × k 1×1 conv2d, ReLU6 h × w × (tk)
h × w × tk 3×3 dwise s=s, ReLU6 h/s × w/s × (tk)
h/s × w/s × tk linear 1×1 conv2d h/s × w/s × k’
There is also an expansion factor t. The authors took 6t for all main
experiments [29; 30]. If the input got 64 channels, the internal output would have
38466464 t channels. Table 3 demonstrates the MobileNetV2 Overall
Architecture, with t being the mentioned expansion factor, c — number of output
channels, n — repeatingnumber, s — stride size; for the spatial convolution 3×3
kernels are used. The authors also note that with the removal of ReLU6 at the
output of each bottleneck module, accuracy is improved (Fig. 12, a), and with
shortcut between bottlenecks, it outperforms shortcut between expansions and the
one without any residual connections (Fig. 12, b) [29; 30].
T a b l e 3 . MobileNetV2 Overall Architecture (source: original paper [29])
Input Operator t c n s
2242 × 3 conv2d - 32 1 2
1122 × 32 bottleneck 1 16 1 1
1122 × 16 bottleneck 6 24 2 2
562 × 24 bottleneck 6 32 3 2
282 × 32 bottleneck 6 64 4 2
142 × 64 bottleneck 6 96 3 1
142 × 96 bottleneck 6 160 3 2
72 × 160 bottleneck 6 320 1 1
72 × 320 conv2d 1×1 - - 1280 1
72 × 1280 avgpool 7×7 - - - 1
1 × 1 × 1280 conv2d 1×1 - k - -
The Hourglass architecture. The so-called Hourglass Network is an
architecture that combines a contracting path to extract information and an
expanding one to map features into locations [31; 32]. The idea behind its name is
that the union of the two paths is usually seen as an hourglass, where information
gets narrower before getting expanded again. This architecture takes its beginning
from Fully Convolutional Networks (FCNs) [34]. Presented in 2015, it is aimed at
1
2
3
1 –
2 –
3 –
1
2
1 –
2 –
Step, millions Step, millions
a b
Fig. 12. a — Impact of Linear Bottleneck; b — impact of Shortcut (source: original
paper [29])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 85
modifying the typical structure of Convolutional Neural Networks (CNNs) to
obtain segmentation maps as output. As one knows, CNN is a deep network that
uses 2 operations: convolution and pooling. A convolution is a mathematical
function that, through the use of filters, can extract the presence of different
(learned) features in an input. The pooling operation is used to reduce the size of
the convoluted matrix, condensing the extracted information (Fig. 13).
The basic structure of a CNN can be seen on Fig. 14. Here convolutions and
pooling operations are applied in sequence, resulting in fully-connected layers for
the classification of the received input [35]. The FCNs are intended to replace the
final fully-connected layers with upsampled versions of the pooling layers output,
and thus to retain spatial information and map them into the original input [34].
The FCN also employs the upsampled version of the last pooling layer and com-
bines different upsampling of various pooling layers in the network. Thanks to
this the outputs of deeper layers, which contain more information about features,
can be combined with the outputs of early layers, which still have information
about location [34]. Thus the output of the network’s final layer is a segmentation
map which is built on information from several previous layers, instead of just the
final one. However, this approach is only the basis for hourglass networks. More
deeply, the idea of the hourglass architecture can be seen on the following models.
One of further developments of FCNs is an U-Net architecture [36]. It is
usually considered the earliest example of an hourglass network and was Initially
developed for the biomedical images segmentation. U-Net’s approach uses the
FCN concept to achieve impressive performance, with the main difference
between U-Net and FCNs being that the structure of the upsampling operations,
which is not a single one anymore but it matches the length of the downsampling
path. Now these two paths are now symmetrical and actually lead to model being
called U-Net because of the network shape (Fig. 15, a). In this architecture after a
downward path with the classic CNN structure of sequential convolution-pooling
operations, the upward path upsamples the received input and concatenates it with
the corresponding layer of the downward path (similar as FCN does). In each
upward layer, the previous output gets upsampled and de-convoluted. Then it is
combined with the output of the corresponding layer in the downward path (as
Fig. 13. Convolution and Poling operations (source: medium.com/@calleris.enrico [32])
Fig. 14. Basic structure of a CNN: Convolutional Operations with Pooling Operations and
Fully-Connected Layers. (source: medium.com/@calleris.enrico [32])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 86
shown by the horizontal arrows in Fig. 15, a). Thanks to this operation the feature
information extracted by the lower layers can be combined with layers with the
more spatially-resoluted outputs of the early convolutional layers. Thus, a
complete localized feature map is obtained which gives as final output an accurate
segmentation map.
Another variant of an hourglass network is the V-Net. This model was
introduced in 2016 and was intended for segmentation of 3D medical images [38].
This architecture has quite similar approach as U-Net as it is also based on two
symmetric contracting and expanding paths. The main difference is that unlike U-
Net it uses a fully-convolutional structure where convolution operations are
present exclusively and pooling layers are absent. This approach is two-folded
because the pooling operation can easily be replaced by a convolution with a
larger stride, and thus the network can be trained faster [39]. Also it would be
easier to apply the corresponding upsampling operation in the expanding path as
de-convolutions are preferred to un-pooling in order to simplify the understanding
and analysis [32; 38]. It is also worth mentioning the difference in the training
procedure between U-Net and V-Net: the U-Net uses the classic stochastic
gradient descent, while V-Net utilizes residual connections [15] to make
convergence faster and improve the segmentation results.
But despite all highlighted differences, U-Net and V-Net still share similar
approach, based on two different interconnected paths: the downward
(contracting) path for progressive feature extraction from the scene and the
upward (expanding) path for mapping of the extracted features to specific
locations in the original image [37]. The overall concept shows why it was called
a hourglass as it mimicks the two paths that contract and expand, meeting midway
to form the hourglass shape (Fig. 15, b). The output of this type of architecture is
as an accurate segmentation map, and it still is one of the most used approaches in
computer vision [40].
EfficientDet architecture. EfficientDet is a family of object detection mod-
els. As model efficiency is very important in computer vision, a lot of research
has been made in recent years towards more accurate object detection [41]. But
one knows that the more accurate object detection network is more expensive in
terms of number of parameters (FLOPS) it gets. The most simple and straightfor-
a
b
Fig. 15. a — U-Net architecture with downward path extracting features (left) and an up-
ward path mapping them into locations (right); b — conceptualization of the Hourglass
Network, with the contracting path (up) and the expanding path (down) (source: me-
dium.com/@calleris.enrico [32])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 87
ward way to increase the accuracy of object detection network is either to make
the network deeper by increasing the number of layers, or increase the number of
channels, or increase the model input image resolution. However, the random in-
crease of any one among the dimensions mentioned above will diminish the accu-
racy gain. So in EfficientDet paper [42] authors introduced a systematic way of
model scaling and they show that carefully balancing network depth, width and
resolution can lead to better performance.
The initial concept of model scaling, i.e. increase of the network depth,
width and resolution to enchance its performance, was presented in the Efficient-
Net paper [43] for image classification, but in the result of EfficientNet testing the
authors implement this technique for object detection and called it as EfficientDet
[41; 42]. This architecture is based on the paradigm of one-stage detectors: these
detectors use ImageNet-pretrained EfficientNets as the backbone network. Thus,
the Bidirectional Feature Pyramid Network (BiFPN) was introduced and it serves
as the feature network by taking level 3–7 features (P3, P4, P5, P6, P7) from the
backbone network and repeatedly applying top-down and bottom-up bidirectional
feature fusion. These fused features are then fed to a class and box network pre-
dictor, in order to generate object class and bounding box predictions respectively
(Fig. 16).
The development of BiFPN can be seen on Figure 17 [42; 45]. Here, FPN
[22] is a baseline way to fuse features with a top down flow; Path Aggregation
Network (PA Net) [45] allows the feature fusion to go backwards and forwards
from smaller to larger resolution; NAS-FPN [46] is also another feature fusion
technique created earlier. The EfficientDet architecture uses the edited structure
of NAS-FPN to create on the BiFPN blocks and stacks them on top of each other
with number of blocks varying in the model scaling procedure. Also, considering
that certain features and feature channels might vary in the amount that they
contribute to the end prediction, a set of weights was added at the beginning of
the channel that are learnable. Before EfficientDet, model scaling for image
detection generally scaled portions of the network independently, as for example,
the ResNet scales only the size of the backbone network. This idea is similar to
the joint scaling approach used to create EfficientNet. For EfficientDet the scaling
task was set to vary the size of the backbone network, the BiFPN network, the
class/box network, and the input resolution. The backbone network scales up
directly with the pretrained checkpoints of EfficientNet-B0 through EfficientNet-
B6 and its width and depth are varied along with the number of BiFPN stacks [44].
Fig. 16. EfficientDet architecture: it uses EfficientNet as the backbone network, BiFPN as
the feature network and shared class/box prediction network. (source: original paper [42])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 88
The specifications of EfficientDet are the following: the Backbone Network
uses the same width/depth scaling coefficients as EfficientNet-B0 to B6 so that
ImageNet-pretrained checkpoints can be used. The BiFPN network width (num-
ber of channels) grows exponentially as done in EfficientNets, but the depth
(number of layers) is increased linearly since it needs to be rounded to small inte-
gers. After a grid search, 1.35 is detected as best scale factor for width:
3),35.1(64 bifpnbifpn DW .
Box/class prediction network has the same width as the BiFPN but the depth
is linearly increased:
3/3 classbox DD .
As for input image resolution: since feature levels 3–7 are used in BiFPN,
the input resolution must be dividable by 72 128 , so resolutions are increased
linearly.
128512 inputR .
T a b l e 4 . EfficientDet scaling (source: original paper [42])
Depth
Input
size
inputR
Backbone
Network
BiFPN
#channels
bifpnW
BiFPN
#layers
bifpnD
Box/class
#layers classD
D0 (= 0) 512 B0 64 3 3
D1 ( = 1) 640 B1 88 4 3
D2 (= 2) 768 B2 112 5 3
D3 ( = 3) 896 B3 160 6 4
D4 ( = 4) 1024 B4 224 7 4
D5 (= 5) 1280 B5 288 7 4
D6 (= 6) 1280 B6 384 8 5
D7 (= 7) 1536 B6 384 8 5
D7x 1536 B7 384 8 5
Single Shot MultiBox Detector (SSD). The SSD [27; 47] is a detector de-
signed for real-time object detection. While Faster R-CNN [26] utilizes an RPN
for boundary box creation and uses those boxes for object classification, despite
its accuracy it does not perform very fast (up to 7 fps) and thus is not suitable for
Fig. 17. Feature network design: a — FPN [23] introduces a top-down pathway to fuse
multi-scale features from level 3 to 7 (P3 - P7); b — PANet [26] adds an additional bot-
tom-up pathway on top of FPN; c — NAS-FPN [10] use neural architecture search to find
an irregular feature network topology and then repeatedly apply the same block; d — is
our BiFPN with better accuracy and efficiency trade-offs. (source: original paper [42])
b ca d
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 89
real-time object detection. The SSD performs faster due to elimination of need for
RPN and maintains fine accuracy by using multi-scale features and default boxes
as improvements. This allows SSD to increase its speed using images with lower
resolution and keep its performance at the same level of accuracy as Faster R-
CNN. This fact was confirmed by our experiment, as shown in the model per-
formance comparison at the end of this paper.
Single Shot MultiBox Detector composes of 2 parts: feature maps extraction
and application of convolution filters for object detection. For feature maps ex-
traction the VGG16 architecture [48; 49] is used and it detects objects with help of
Conv4_3 layer (Fig. 18, a). As for example (Fig. 18, b), a 88 Conv4_3 is drawn
spatially (should be 38×38) and for each cell (aka location) it produces 4 object
predictions. Each prediction is represented by a boundary box and 21 scores for
each class (plus 1 extra class for no object) and one picks the highest score as the
bounded object’s class. The Conv4_3 in total produces 38×38×4 predictions, four
predictions per cell regardless of the depth of the feature maps. Since many of
these predictions contain no object, SSD reserves a class “0” to mark that the box
has no objects (Fig. 19, a). The process of making multiple predictions containing
boundary boxes and confidence scores is called multibox [47].
As for predictors for object detection, SSD does not use a special RPN; in-
stead, it calculates both the location and class scores with help of small convolu-
tion filters. After feature maps extracting, the detector applies 3×3 convolution
filters for each cell to make predictions with these filters calculating predictions in
the same way as regular CNN filters. The output of each filter consists of 25
channels: 21 scores for each class + one boundary box (ex. applying four 3×3 filters
in Conv4_3 for 512 input channels mapping gives 25 output channels) (Fig. 19, b).
))421(43838()5123838(
))421(512334(
.
For detection, SSD uses multiple layers (multi-scale feature maps) to detect
objects independently. The CNN gradually reduces the spatial dimension and thus
the resolution of the feature maps also decreases. SSD uses lower resolution lay-
ers to detect larger scale objects: for example, the 4×4 feature maps are used for
larger scale objects (Fig. 20, a). Also, SSD adds 6 more auxiliary convolution
layers after the VGG16, with 5 of them added for object detection. In three of
those layers, one makes 6 predictions instead of 4 and in total, SSD makes 8732
predictions using 6 layers (Fig. 20, b). Thanks to usage of Multi-scale feature
maps the accuracy is significantly improved.
Fig. 18. a — General SSD architecture idea; b — detector work: left: the original image,
right: 4 predictions at each cell (source: article [47])
ba
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 90
SSD also uses the idea of default boundary boxes that are equivalent to an-
chors in Faster R-CNN [26]. For prediction of boundary boxes one can start with
random predictions and use gradient descent for model optimization. During the
initial training, however, there can be a problem of determining what shapes (cars
or pedestrians) may be optimized for which predictions and research showed that
early training can be very unstable. The boundary box predictions on Fig. 21, a
work well for one category but not for others and it is necessary for initial predic-
tions to be diverse and not looking similar. So for detection of number of objects
the predictions have to cover more shapes. Such approach makes training easier
and more stable (Fig. 21, b).
In real-life, boundary boxes do not have arbitrary sizes and shapes, so the
ground truth boundary boxes can be partitioned into clusters with each cluster
being represented by a default boundary box, i.e., by the centroid of the cluster. In
this way instead of making random predictions one can start the guesses based on
those default boxes. The SSD detector also keeps the default boxes to a minimum
(4 or 6) with one prediction per default box. As for boundary box localization, its
predictions do not use global coordinates; instead, they are relative to the default
boundary boxes at each cell (∆cx, ∆cy, ∆w, ∆h), i.e. the offsets (difference) to the
default box at each cell for its center (cx, cy), width and height. Each feature map
layer shares the same set of default boxes centered at the corresponding cell, but
different layers use different sets of default boxes to adjust object detections at
different resolutions (Fig. 21, c).
SSD’s default boundary boxes are chosen manually and network defines a
scale value for each feature map layer. As it was shown on Fig. 20, b, starting
from the left, Conv4_3 detects objects at the smallest scale of 0.2 or sometimes
Fig. 20. Lower resolution feature maps (right) detects larger-scale objects — a [47]; SSD
full architecture — b [27; 47]
a
b
a
b
Fig. 19. Multibox predictor — a; convolutional predictors for object detection (source:
article — b [47])
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 91
even 0.1. Then, it linearly increases to the rightmost layer at a scale of 0.9. After
that one calculates width and height of the default boxes by combining the scale
value with the target aspect ratios. As layers make 6 predictions, SSD starts with
5 target aspect ratios of 1, 2, 3, 1/2, and 1/3. Then the width and the height of the
default boxes are calculated with 1ratioaspect . YOLO [52] uses k-means clus-
tering on the training dataset to determine those default boundary boxes.
;
ratioaspect
,ratioaspect
scale
h
scalew
SSD adds an additional default box with scale:
levelnext at scale scale scale .
With SSD’s matching strategy, predictions are classified as positive matches
or negative matches. SSD uses only positive matches for localization cost calcula-
tion (the mismatch of the boundary box) and if the corresponding default bound-
ary box (and not the predicted boundary box) has an 5.0IOU with the ground
truth, then the match is considered positive; otherwise, it is negative. IOU (Inter-
section over Union) is a metric used to measure the degree of overlap between
two bounding boxes and it calculates the ratio of the area of overlap between the
two boxes to the area of their union. Mathematically, it is represented as
uniononintersecti / SSIOU . For example, if there are 3 default boxes and only de-
fault box 1 and 2 (not 3) have an 5.0IOU with the ground truth box above
(blue box, Fig. 22, a). Then only box 1 and 2 are positive matches and once the
positive matches are identified, one calculates the cost using the corresponding
Fig. 21. With predictions not being diverse, the model will not perform — a; diverse pre-
dictions cover more object types — b; 4 default boundary boxes — с [47]
ca b
Fig. 22. The ground truth object (blue) and 3 default boundary boxes (green) —
a; default box 2 has IOU > 0.5 with the ground truth — b [47]
ba
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 92
predicted boundary boxes (Fig. 22, b). Such matching strategy encourages each
prediction to predict shapes closer to the corresponding default box, and in this
way the predictions are more diverse and stable in the training [47]. Fig. 23 shows
how SSD uses the combination of multi-scale feature maps and default boundary
boxes to detect objects at different scales and aspect ratios. The dog matches one
default box (in red) in the 4×4 feature map layer, but no other default boxes in the
higher resolution 8×8 feature map, while the cat, because of being smaller, is de-
tected only by the 8×8 feature map layer in 2 default boxes (in blue). Higher-
resolution feature maps are responsible for detecting small objects and the first
layer for object detection, Conv4_3, has a spatial dimension of 38×38 which is
large reduction from the input image. That is way SSD usually performs badly for
small objects comparing with other detection methods but this problem can be
reduced by using images with higher resolution.
SSD’s localization loss is the mismatch between the ground truth box and
the predicted boundary box and SSD only penalizes predictions from positive
matches. It is necessary for the predictions from the positive matches to get closer
to the ground truth; negative matches can be ignored. This loss is defined as the
smooth L1 loss with cx, cy as the offset to the default bounding box d with width
w and height h [27]:
)ˆ(smooth),,(
},,,{
1
m
j
m
i
N
Posi hwcycxm
L
k
ijloc glxglxL
;
h
i
cy
i
cy
j
cy
j
w
i
cx
i
cx
j
cx
j ddggddgg /)(ˆ,/)(ˆ ;
h
i
h
jh
jw
i
w
jw
j
d
g
g
d
g
g logˆ,logˆ ;
.otherwise,0
,. classon box trueground
and box default between 5.0IOU if
,1
pj
i
x p
ij .
SSD also uses confidence loss as the loss of making a class prediction. It pe-
nalizes the loss according to the confidence score of the corresponding class for
every positive match prediction. For negative match predictions, it penalizes the
loss according to the confidence score of the class “0”: class “0” means no object
a
b
Fig. 23. SSD detector work example — a [27]; size reduction — b [47]
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 93
is detected. The loss is the softmax loss over multiple classes confidences c (class
score), and N is the number of matched default boxes:
p
p
i
p
ip
i
Negi
i
N
Posi
p
i
p
ijconf
c
c
cccxcxL
)(exp
)(exp
ˆwhere,)ˆ(log)ˆ(log),( 0 .
The final loss function is calculated as:
)),,(),((
1
),,,( glxLcxL
N
glcxL locconf ,
where N is the number of positive matches and is the weight for the localiza-
tion loss.
For the removal of duplicate predictions pointing to the same object SSD
uses non-maximum suppression. It sorts the predictions by the confidence scores
and starting from the top confidence prediction, SSD evaluates whether any pre-
viously predicted boundary boxes have (Intersection Over Union) 0.45IOU
with the current prediction for the same class and if found, current prediction will
be ignored. Despite that, there are still much more predictions made than the
number of objects present, sothere are many more negative matches than positive
matches. This way the class imbalance problem appears and it worsens the train-
ing as the model then is training to learn background space rather than detecting
objects. But SSD still requires negative sampling so it can recognize what repre-
sents a bad prediction and thus it sorts those negatives by their calculated confi-
dence loss instead of using all of them. It takes the negatives with the top loss and
makes sure the ratio between the picked negatives and positives is at most 3:1,
which makes the training process faster and more stable [47].
Faster R-CNN architecture. Faster R-CNN is an object detection architec-
ture presented [26; 50; 51] in 2015. It is one of the famous object detection archi-
tectures that use convolution neural networks like SSD (Single Shot Detector) or
YOLO (You Look Only Once) [52]. The idea behind the development of Faster
R-CNN network was to create a unified architecture that not only detects objects
within an image but also locates the objects precisely in the image. Faster R-CNN
architecture uses the benefits of deep learning, CNNs and RPNs resulting in a
combined network that significantly improves the speed and accuracy of the
model [50]. It consists of two key components: Region Proposal Network (RPN)
and Fast R-CNN detector and as a backbone it utilizes Shared Convolutional Lay-
ers, common CNN layers used for both RPN and Fast R-CNN detector (Fig. 24, a)
Faster R-CNNs backbone, the CNN is used for extraction of relevant fea-
tures from the input image and consists of multiple convolutions layers that apply
different convolutions kernel to extract those features. The convolutions kernels
are designed to capture the hierarchical representations of the input image. This
means that starting from the initial layers, CNN captures the low-level features
(basic textures or edges) and then with much deeper layers it captures the high
level semantic features like objects parts and shapes. Both RPN and Fast R-CNN
detector uses the same extracted hierarchical features. This approach helps to sig-
nificantly reduce the computing time and memory use as the computations per-
formed by these layers are employed for both tasks.
Previously R-CNN and Fast R-CNN architectures use Selective Search
algorithm [53] region proposals generation. This process is executed on CPU and
thus takes more time in computations. With the introduction of Faster R-CNN
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 94
[26] this problem was fixed by using a convolutional-based network i.e. RPN.
This step reduced proposal time for each image from 2 seconds to 10 ms and
improved feature representation by sharing layers with detection stages. Region
Proposal Network is a key component of Faster R-CNN architecture, as it is
responsible for generating possible ROIs (regions of interest or region proposals)
in images that may contain objects. It is based on the concept of attention
mechanism in neural networks that tells the subsequent Fast R-CNN detector
where in the image the objects should be searched for. The main components of
the Region Proposal Network are as follows [50]:
1. Anchors boxes: Anchors are used for region proposals generation in the
Faster R-CNN model. It uses a set of predefined anchor boxes with different
scales and aspect ratios and these anchor boxes are placed at different positions on
the feature maps. An anchor box’s two key parameters are scale and aspect ratio
2. Sliding Window approach: The RPN runs as a sliding window over the
feature map received from the CNN backbone. It uses a small convolutional net-
work, usually a 3×3 convolutional layer, to process the features within the sensi-
tive field of the sliding window and thus this convolutional operation produces
scores that represent the object presence probability and regression values for ad-
justing the anchor boxes.
3. Objectness Score: This value represents the probability that a given an-
chor box contains an object of interest rather than being just background. RPN
predicts this score for each anchor and this objectness score reflects the confi-
dence that the anchor corresponds to a meaningful object region. This score is
also used for anchor classification as either positive (object) or negative (back-
ground) during training.
4. IOU (Intersection over Union) metric.
5. Non-Maximum Suppression (NMS): This operation is used to remove re-
dundant and select the most accurate proposals, based on the mentioned object-
ness scores of overlapping proposals and it keeps only proposals with the highest
score while eliminating the others.
The RPN uses feature maps the were produced by CNN backbone and
applies on them a sliding window approach (as it was shown in Fig. 9, c) with
anchor boxes of varying scales and shapes to marks potential object positions.
This way these anchor boxes are enhanced in the process of training in order to
match better actual object positions and sizes. For each anchor, the RPN predicts
two parameters: objectness score, i.e. probability that anchor contains object, and
adjustments for the anchor coordinates to match the actual object’s shape. As a
large number of region proposals are generated and many of them overlap and
correspond to the same object, Non-Maximum Suppression is used to rank the
anchor boxes according to their objectness probabilities and take only the top-N
anchor boxes with the highest scores. Thus it can be guaranteed that the final
selected proposals are both accurate and non-overlapping and algorithm considers
these selected anchor boxes as as possible region proposals.
The next key component of Faster R-CNN network is the Fast R-CNN de-
tector itself as it is responsible for object detection within the region proposals
suggested by the RPN. It operates in several stages [50]:
1. Region of Interest (RoI) Pooling: at this first step ROI pooling is applied
to the region proposals suggested by the RPN. This operation transforms RPN’s
variable-sized region proposals into fixed-size feature maps that will be handed
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 95
into the network’s subsequent layers. RoI pooling divides each region proposal
into a grid of cells of equal size and then applies max pooling within each cell,
thus, generating a fixed-size feature map for each region proposal (Fig. 24, b).
2. Feature Extraction: in this stage feature maps, obtained after ROI Pool-
ing, are fed into the CNN backbone (the same one used in the RPN for feature
extraction) in order to extract meaningful features that capture object-specific in-
formation and thus one draws hierarchical features from region proposals. Hierar-
chical features keep the spatial information while separating low-level details and
thus allow the network to understand the content of proposed regions.
3. Fully Connected Layers: the algorithm passes the RoI-pooled and feature-
extracted regions through a series of fully connected layers as they are used for
object classification and bounding box regression.
a. As for Object Classification, network predicts class probabilities for each
region proposal and points out the possibility that the proposal contains an object
of a specific class, and then, the classification is performed by combining the fea-
tures pulled out from the region proposal with the shared weights of the CNN
backbone.
b. Bounding Box Regression: Together with class probabilities, the network
predicts bounding box adjustments for each region proposal, which refine the po-
sition and size of the region proposal’s bounding box and align it more accurately
with the actual object boundaries. The first layer is a softmax layer of N+1 output
parameters that predicts the objects in the region proposal, with N being the num-
ber of class labels and background. The second layer is a bounding box regression
layer with 4 N output parameters and is used for bounding box regression of the
object in the image.
Fast R-CNN detector uses the Multi-task Loss Function as the loss function
[26]. It combines classification and regression losses, with classification loss cal-
culating the difference between predicted and true class probabilities and the re-
gression loss calculating the difference between predicted and actual bounding
box adjustments:
),(
1
),(
1
),,( **
iii regi
reg
iii cls
cls
iii vtLp
N
ppL
N
vtpL
b
ca
Fig. 24. Faster R-CNN architecture idea — a; ROI Pooling —b; bounding box regression
— c [26; 50]
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 96
where, clsN is the number of ROIs used for classification, regN is the number of
ROIs used for bounding box regression; ip is the predicted probability of classi-
fying the i-th ROI; *
ip is binary (1 or 0) ground-truth indicator for the i-th ROI
being a foreground or background object; it represents the ground-truth bounding
box parameters for the i-th ROI; iv represents the predicted bounding box ad-
justments for the i-th ROI; clsL is the classification loss function, usually com-
puted using cross-entropy loss; regL is the regression loss function, usually com-
puted using smooth L1 loss and is a balancing parameter for controling the
trade-off between the two components of the loss.
After predicting class probabilities and bounding box changes, the final de-
tection results are refined using a post-processing procedure. In this step, non-
maximum suppression (NMS) is used to reduce redundant detections while keep-
ing the most confident and non-overlapping ones.
Several additional notes on Faster R-CNN Process can be made [50]. For
Fast R-CNN object detection network two fully convolution models are com-
monly used: Zeiler and Fergus model (ZF) [54] with 5 shareable convolutional
layers or Simonyan and Zisserman model (VGG-16) [48] with 13 shareable con-
volutional layers. With the Sliding Window Approach, RPN operates on an n×n
spatial window of the input convolutional feature map with each sliding window
being mapped to a lower-dimensional feature. Its dimensions are 256-d for Zeiler
and Fergus model (ZF) and 512-d for Simonyan and Zisserman model (VGG-16).
Further it is followed by a Rectified Linear Unit (ReLU) activation. The sliding
window architecture is effectively realized using an n×n convolutional layer, fol-
lowed by two 11 convolutional layers for box regression and box classification;
if for example, 3n for the sliding window, it leads to a large effective receptive
field on the input image: 171 pixels for ZF and 228 pixels for VGG). For each
window position, K region proposals are generated with proposal being defined
by an anchor box, which is set by scale and aspect ratio. Multiple anchor boxes
are created by varying these parameters and it results in different scales and aspect
ratios. Thus a set of anchor boxes is created, usually 9K , allowing the model to
consider various object sizes and shapes. These anchor variations allow the model to
handle scale invariance and share features between the RPN and Fast R-CNN.
For each generated region proposal, a feature vector is extracted with
a length of 256 (for ZF net) or 512 (for VGG-16 net) and is then processed by two
sibling fully-connected (FC) layers: the lower-dimensional feature extracted from
the sliding window is fed into two sibling fully-connected layers. The Box-
Classification FC Layer (cls) predicts an objectness score for the proposed region,
it is a binary classifier that assigns an objectness score to each region proposal
and determines whether the proposal contains an object or is part of the back-
ground. This layer also produces two outputs: one for classifying the region as
background and another for classifying the region as an object. The objectness
score assigned to each anchor helps to generate the classification label. The Box-
Regression FC Layer (reg) predicts adjustments for the bounding box of the pro-
posed region and returns a 4-D vector that defines the bounding box of the region
proposal.
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 97
The Region Proposal Network (RPN) is trained end-to-end using
backpropagation and stochastic gradient descent (SGD), i.e. entire network,
including the newly added RPN layers and the shared convolutional layers, is
optimized together to minimize the loss function. The training strategy is an
“image-centric” sampling [50], in which each mini-batch is derived from a single
image. This image contains both positive (with object) and negative (background)
example anchors and instead of optimizing the loss function for all anchors, the
network randomly picks 256 anchors from the image to calculate the loss for the
mini-batch. The sampled positive and negative anchors are balanced at a ratio of
up to 1:1. Also in order to overcome the potential bias towards negative samples,
the training makes sure that each mini-batch contains a balanced mix of positive
and negative examples. Thus if an image has less than 128 positive samples,
additional negative samples are added to the mini-batch to keep the correct ratio.
Regarding the layer initialization, new layers added to the architecture are
initialized by getting weights from a Gaussian distribution with a mean of zero and a
standard deviation of 0.01. This random initialization is applied to the layers specific
to the RPN. The existing shared convolutional layers are initialized using weights
pretrained on the ImageNet classification task according to standard practice.
YOLO detector. You Only Look Once (YOLO) [52] is an architecture, that
uses end-to-end neural network and allows to makes predictions of bounding
boxes and class probabilities all at once [55]. The main difference from other ob-
ject detection algorithms is that they repurposed classifiers to make detections.
YOLO model is based on fundamentally different object detection approach and
thus it performs extremely fast, significantly outperforming other real-time object
detection algorithms. This detector does all of its predictions with the help of a
single fully connected layer in contrast to other networks, like Faster RCNN,
which detect possible regions of interest with help of RPN and then do recogni-
tion on those regions separately. In other words, for the same image YOLO does a
single iteration when RPN-baset networks perform multiple iterations.
The YOLO architecture [52; 55] is depicted on Fig. 25: the algorithm re-
ceives an image as input, then detects objects on this image with help of a simple
deep convolutional neural network as its backbone. In this model the first 20 con-
volution layers are pre-trained using ImageNet by plugging in a temporary aver-
age pooling and fully connected layer and then, this pre-trained model is con-
verted to perform detection, because thanks to earlier studies it was carried out
that adding convolution and connected layers to a pre-trained network helps to
improve performance. The model’s final fully connected layer makes predictions
on both class probabilities and bounding box coordinates.
Fig. 25. YOLO original architecture (source: original paper [52; 55])
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 98
During processing, YOLO’s algorithm divides an input image into an S×S
grid. Within this grid it checks, if the center of an object gets placed within some
grid cell, then that grid cell is responsible for detecting that object. Thus, each
grid cell predicts B bounding boxes and confidence scores for those boxes and
these confidence scores indicate how confident the model is that some box con-
tains an object and how accurate the model thinks that predicted box is. The algo-
rithm predicts multiple bounding boxes per grid cell. As at training stage it is nec-
essary that for each object only one bounding box predictor should be responsible
for this object, YOLO assigns one predictor to be “responsible” making predic-
tions of an object based on which prediction has the highest current IOU with the
ground truth. This task leads to bounding box predictors being specialized for dif-
ferent tasks: each predictor gets better at forecasting certain sizes, aspect ratios, or
classes of objects, improving the overall recall score. YOLO models also use one
key approach called Non-Maximum Suppression (NMS), a post-processing step
used for improvement of object detection accuracy and efficiency. In the process
of object detection it is a common situation when for a single object in an image
multiple bounding boxes are generated. As such bounding boxes can be located at
different positions or may overlap while still representing the same object, NMS
is used to identify and remove redundant or incorrect bounding boxes and to out-
put a single correct bounding box for each object in the image [55].
Since the initial release of YOLO in 2015, it had received several modifica-
tions and improvements and thus new versions have appeared. For our research
the YOLO v5 version was chosen as it is still one of the most recent modifications
and is easy to install for instant usage. The v5 version [56] was introduced in
2020 by developers of original YOLO. Unlike the original model, v5 version uses
EfficientDet-based architecture as a backbone. It allowed the new model to in-
crease accuracy and generalization to a wider range of object categories. The v5
version also uses a new method for generating the anchor boxes, called “dynamic
anchor boxes”, which involves using a clustering algorithm to group the ground
truth bounding boxes into clusters and then it uses the centroids of the clusters as
the anchor boxes. Thanks to this the anchor boxes can be more closely aligned
with the size and shape of detected objects. One more new idea that was intro-
duced in YOLO v5 is the concept of the Spatial Pyramid Pooling (SPP). It is a
type of pooling layer used to reduce the spatial resolution of the feature maps. It
allows the model to see the objects at multiple scales and thus improves the detec-
tion performance on small objects. In addition, v5 model has introduced a new
variant of the IOU loss function called “CIOU Loss” and designed to improve the
model's performance on imbalanced datasets [55].
Training and experiment. The whole experiment was performed locally on
a laptop with an Intel i9-13980HX CPU, NVidia 4090 16GB laptop GPU and
64GB RAM; CUDA 11.2. Due to hardware limitations the training was decided
to be limited by 5000 train steps for general comparison. The test runs for each
model were performed on the same machine. The dataset used for training con-
sisted of 562 images (selected video frames). The test set consisted of 63 images.
The part of test is not only to evaluate the trained model accuracy on train dataset,
but to see how fast the models train considering their different internal architec-
ture. As mentioned previously, we did not measure FPS because each video frame
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 99
is additionally preprocessed to improve its color and remove noise, and thus this
processing takes some additional time.
Also because of hardware limitations (GPU memory size) each model was
trained with maximum possible image batch size. For example, we did not
consider the EfficientDet D3 or higher versions since on current hardware it was
possible only to train with a batch of 4 which was considered not enough for good
model training. Also as the CenterNet MobileNetV2 FPN 512512 training
results were very poor despite 5.5h of training it was decided to exclude it from
the comparison.
Note on detection speed benchmark: for the first image, the detection time
was always much longer comparing to other images in test set due to the model
being loaded by program for the first time. Thus we excluded the detection time
for the first image from average detection speed calculation as this delay occurs
for the very first image only and is insignificant for the further detection period.
Fig. 26 shows some examples of work of the neural network object detector.
CONCLUSIONS
The experiment results are shown on Table 5. According to them, it was carried
out, that such models as the EfficientDet, SSD net with ResNet50 V1, ResNet101
V1 backbone and the Faster R-CNNs are not very suitable for training on usual
hardware as they operate with large amounts of data during training and thus re-
quire a lot of memory when trying larger batch sizes which can become problem-
atic. Using smaller batch size can cause model not to train good enough even de-
spite that its training time was sometimes less compared to other models.
Extending the training time also cannot be good option as seen EfficientDet 10K
step training. As for other mentioned models (SSDs and Faster R-CNNs), with the
same training amount of 5K steps these models also showed smaller accuracy.
The detection speed benchmark results also show that for example, only
CenterNet Resnet101 V1 FPN 512х512, SSD MobileNet V1 FPN and YOLO v5
manage to fit within the given detection time threshold of about 110 ms. Other
SSD nets managed to show longer detection time with a bit smaller accuracy then
the SSD MobileNet V1 FPN.
Thus, for our further research it is decided to consider the SSD MobileNet
V1 FPN and YOLO as the most suitable for detection of relatively simple objects
with minimum visual details. This result might be useful for anyone trying to use
any of studied models for similar tasks. From all tested model only these ones can
be trained with minimum significant amount of data and perform accurate and
fast enough even at less capable hardware then used for our experiment. Thus we
Fig. 26. NN object detection results
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 100
will consider them as primary neural object detectors for fish detection and track-
ing. But we still might consider CenterNet with HourGlass104 and Resnet101 V1
FPN for some other tasks as they also maintain relatively good ratio of detection
speed and especially training possibilities and resulting accuracy.
T a b l e 5 . Model training and evaluation results
Architec-
ture Backbone
Network
input
size
Training
batch
size
Training
time
(approx.)
AP
(%)
AP50
IOU
(%)
AP75
IOU
(%)
AP(M)
(%)
AP(L)
(%)
Avg detec-
tion
speed (ms)
CenterNet Hour-
Glass104 512х512 10 5.5 h 79.16 99.23 93.64 62.02 80.9 183
CenterNet Resnet101
V1 FPN 512х512 20 5 h 74.59 98.19 91.05 58.01 76.31 83
Efficient-
Det D2
Efficient-
Net 896x896 6 3 h 55.83 92.31 62.23 32.65 58.32 233
Efficient-
Det D2
Efficient-
Net 896x896 6 (10K
steps) 3 h 65.47 93.9 82.41 45.31 67.44 263
SSD MobileNet
V1 FPN 640x640 32 10 h 87.42 99.03 98.96 75.58 88.8 112
SSD MobileNet
V2 FPNLite 640x640 28 4.5 h 73.43 97.5 89.95 55.31 75.22 116
SSD ResNet50
V1 FPN
1024x10
24 8 7.5 h 73.53 96.6 88.62 47.13 76.17 141
SSD ResNet101
V1 FPN
1024x10
24 5 3.5 h 23.29 44.47 24.45 3.04 25.46 175
Faster
R-CNN
ResNet50
V1
800x133
3 6 2 h 71.41 98.5 88.89 48.0 73.67 143
Faster
R-CNN
ResNet101
V1
1024x10
24 5 3 h 70.85 97.65 88.72 51.48 72.85 241
AP50-
95 Preci-
sion Recall
Yolo V5 CSPDark-
net53
640 x
640 80 1.5 h 98.5 98.9 - 98.1 97.7 8
Acknowledgements. Special thanks for the research assistance and provided
test videos and images of lab animals to Faculty of Biology of Odesa I.I.
Mechnikov National University.
REFERENCES
1. A.R. Smith, “Color Gamut Transform Pairs,” in SIGGRAPH '78: Proceedings of the
5th annual conference on Computer graphics and interactive techniques, pp. 12–
19,1978. doi: 10.1145/800248.807361
2. M.A. Shvandt, V.V. Moroz, “Overview Of The Detection And Tracking Methods Of
The Lab Animals”, System Research & Information Technologies, no. 1, 2022, pp.
124–148. doi: 10.20535/SRIT.2308-8893.2022.1.10
3. V.V. Moroz, M.A. Shvandt, “Study of movement and behavior of laboratory animals
by methods of object detection and tracking”, Herald of the National Technical Uni-
versity ‘KhPI’, Series of ‘Informatics and Modeling’, Kharkiv: NTU ‘KhPI’,
Kharkiv, vol. 13, no. 1338, pp. 93–103, 2019. doi: 10.20998/2411-0558.2019.13.09
4. TensorFlow 2 Detection Model Zoo. Available: https://github.com/tensorflow/mod-
els/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md
5. T.-Y. Lin et al., Microsoft COCO: Common Objects in Context. 2014, 15 p. doi:
10.48550/ARXIV.1405.0312. Available: https://arxiv.org/abs/1405.0312
6. L. Wood, F. Chollet, Efficient Graph-Friendly COCO Metric Computation for
Train-Time Model Evaluation. 2022, 7 p. doi: 10.48550/ARXIV.2207.12120. Avail-
able: https://arxiv.org/abs/2207.12120
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 101
7. K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, CenterNet: Keypoint Triplets for
Object Detection. 2019, 10 p. doi: 10.48550/ARXIV.1904.08189. Available:
https://arxiv.org/abs/1904.08189
8. H. Law, J. Deng, CornerNet: Detecting Objects as Paired Keypoints. 2018, 14 p.
doi: 10.48550/ARXIV.1808.01244. Available: https://arxiv.org/abs/1808.01244
9. X. Lu, B. Li, Y. Yue, Q. Li, J. Yan, Grid R-CNN. 2018, 9 p. doi:
10.48550/ARXIV.1811.12030. Available: https://arxiv.org/abs/1811.12030
10. X. Zhou, D. Wang, P. Krähenbühl, Objects as Points. 2019, 12. doi:
10.48550/ARXIV.1904.07850. Available: https://arxiv.org/abs/1904.07850
11. S. Trivedi, CenterNet: Objects as Points - A Comprehensive Guide. 2020. Available:
https://medium.com/visionwizard/centernet-objects-as-points-a-comprehensive-
guide-2ed9993c48bc
12. L. Huang, Y. Yang, Y. Deng, Y. Yu, DenseBox: Unifying Landmark Localization
with End to End Object Detection. 2015, 13 p. doi: 10.48550/ARXIV.1509.04874.
Available: https://arxiv.org/abs/1509.04874
13. Z. Tian, C. Shen, H. Chen, T. He, FCOS: Fully Convolutional One-Stage Object De-
tection. 2019, 13 p. doi: 10.48550/ARXIV.1904.01355. Available: https://arxiv.org/
abs/1904.01355
14. A. Newell, K. Yang, J. Deng, Stacked Hourglass Networks for Human Pose Estimation.
2016, 17 p. doi: 10.48550/ARXIV.1603.06937. Available: https://arxiv.org/abs/1603.06937
15. K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition.
2015, 12 p. doi: 10.48550/ARXIV.1512.03385. Available: https://arxiv.org/abs/1512.03385
16. A.G. Howard et al., MobileNets: Efficient Convolutional Neural Networks for Mo-
bile Vision Applications. 2017, 9 p. doi: 10.48550/ARXIV.1704.04861. Available:
https://arxiv.org/abs/1704.04861
17. D. Wang, E. Shelhamer, T. Darrell, Deep Layer Aggregation. 2017, 10 p. doi:
10.48550/ARXIV.1707.06484. Available: https://arxiv.org/abs/1707.06484
18. T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal Loss for Dense Object Detection.
2017, 10 p. doi: 10.48550/ARXIV.1708.02002. Available: https:// arxiv.org/abs/1708.02002
19. S. Trivedi. Understanding Focal Loss-A Quick Read. 2020. Available: https:// me-
dium.com/visionwizard/understanding-focal-loss-a-quick-read-b914422913e7
20. S. Bangar, Resnet Architecture Explained. 2022. Available: https://medium.com/@
siddheshb008/resnet-architecture-explained-47309ea9283d
21. P. Ruiz, Understanding and visualizing ResNets. 2018. Available: https://towards-
datascience.com/understanding-and-visualizing-resnets-442284831be8
22. T.-Yi Lin et al., Feature Pyramid Networks for Object Detection. 2016, 10 p. doi:
10.48550/ARXIV.1612.03144. Available: https://arxiv.org/abs/1612.03144
23. J. Hui, Understanding Feature Pyramid Networks for object detection (FPN). 2018.
Available: https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-
for-object-detection-fpn-45b227b9106c
24. S.-H. Tsang, Review: FPN - Feature Pyramid Network (Object Detection). 2019.
Available: https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-
detection-262fc7482610
25. S. Tanwar, FPN (feature pyramid networks). 2020. Available: https://medium.com/
analytics-vidhya/fpn-feature-pyramid-networks-77d8be41817c
26. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks. 2015, 14 p. doi: 10.48550/ARXIV.1506.01497. Avail-
able: https://arxiv.org/abs/1506.01497
27. W. Liu et al., SSD: Single Shot MultiBox Detector. 2015, 17 p. doi:
10.48550/ARXIV.1512.02325. Available: https://arxiv.org/abs/1512.02325
28. S.-H. Tsang, Review: MobileNetV1 - Depthwise Separable Convolution (Light
Weight Model). 2018. Available: https://towardsdatascience.com/review-
mobilenetv1-depthwise-separable-convolution-light-weight-model-a382df364b69
29. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted
Residuals and Linear Bottlenecks. 2018, 14 p. doi: 10.48550/ARXIV.1801.04381.
Available: https://arxiv.org/abs/1801.04381v4
M.A. Shvandt, V.V. Moroz
ISSN 1681–6048 System Research & Information Technologies, 2025, № 4 102
30. S.-H. Tsang, Review: MobileNetV2 - Light Weight Model (Image Classification).
2019. Available: https://towardsdatascience.com/review-mobilenetv2-light-weight-
model-image-classification-8febb490e61c
31. A. Newell, K. Yang, J. Deng, Stacked Hourglass Networks for Human Pose Estimation.
2016, 17 p. doi: 10.48550/ARXIV.1603.06937. Available: https://arxiv.org/abs/1603.06937
32. E. Calleris, The Hourglass Network. 2022. Available: https://medium.com/@calleris.
enrico/hourglass-network-6e74cdb9ce2f
33. S. Li, Simple Introduction about Hourglass-like Model. 2017. Available: https://me-
dium.com/@sunnerli/simple-introduction-about-hourglass-like-model-11ee7c30138
34. J. Long, E. Shelhamer and T. Darrell, Fully Convolutional Networks for Semantic
Segmentation. 2014, 10 p. doi: 10.48550/ARXIV.1411.4038. Available:
https://arxiv.org/abs/1411.4038
35. A. Krizhevsky, I. Sutskever, G.E. Hinton, “ImageNet classification with deep convo-
lutional neural networks,” in Proceedings of the 25th International Conference on
Neural Information Processing Systems - NIPS’12, Curran Associates Inc., Red
Hook, NY, USA, 2012, vol 1., pp. 1097–1105.
36. O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical
Image Segmentation. 2015, 8 p. doi: 10.48550/ARXIV.1505.04597. Available:
https://arxiv.org/abs/1505.04597
37. S. Minaee et al., Image Segmentation Using Deep Learning: A Survey. 2020, 22 p.
doi: 10.48550/ARXIV.2001.05566. Available: https://arxiv.org/abs/2001.05566
38. F. Milletari, N. Navab, S.-A. Ahmadi, V-Net: Fully Convolutional Neural Networks
for Volumetric Medical Image Segmentation. 2016, 11 p. doi: 10.48550/ARXIV.
1606.04797. Available: https://arxiv.org/abs/1606.04797
39. J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for Simplicity:
The All Convolutional Net. 2015, 14 p. doi: 10.48550/ARXIV.1412.6806. Available:
https://arxiv.org/abs/1412.6806
40. D. Oñoro-Rubio, M. Niepert, Contextual Hourglass Networks for Segmentation and
Density Estimation. 2018, 3 p. doi: 10.48550/ARXIV.1806.04009. Available:
https://arxiv.org/abs/1806.04009
41. R. Sharma, EfficientDet: Scalable and Efficient Object Detection. 2021. Available:
https://medium.com/analytics-vidhya/efficientdet-scalable-and-efficient-object-
detection-384a5df9011a
42. M. Tan, R. Pang, Q.V. Le, EfficientDet: Scalable and Efficient Object Detection. 2019,
10 p. doi: 10.48550/ARXIV.1911.09070. Available: https://arxiv.org/abs/1911.09070
43. M. Tan, Q.V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks. 2020, 11 p. doi: 10.48550/ARXIV.1905.11946. Available:
https://arxiv.org/abs/1905.11946
44. J. Solawetz, A Thorough Breakdown of EfficientDet for Object Detection. 2020.
Available: https://towardsdatascience.com/a-thorough-breakdown-of-efficientdet-
for-object-detection-dc6a15788b73
45. S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path Aggregation Network for Instance Seg-
mentation. 2018, 11 p. doi: 10.48550/ARXIV.1803.01534. Available:
https://arxiv.org/abs/1803.01534
46. C. Peng et al., MegDet: A Large Mini-Batch Object Detector. 2017, 9 p. doi:
10.48550/ARXIV.1711.07240. Available: https://arxiv.org/abs/1711.07240
47. J. Hui, SSD object detection: Single Shot MultiBox Detector for real-time process-
ing. 2018. Available: https://jonathan-hui.medium.com/ssd-object-detection-single-
shot-multibox-detector-for-real-time-processing-9bd8deac0e06
48. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale
Image Recognition. 2014, 14 p. doi: 10.48550/ARXIV.1409.1556. Available:
https://arxiv.org/abs/1409.1556
49. J. Boschman, VGG16 (2014) – one minute summary. 2021. Available: https://me-
dium.com/one-minute-machine-learning/very-deep-convolutional-networks-for-
large-scale-image-recognition-2014-one-minute-summary-44a8f04586ab
50. Faster R-CNN – ML. Available: https://www.geeksforgeeks.org/faster-r-cnn-ml/
Overview of neural network object detection methods & modeles on the example of …
Системні дослідження та інформаційні технології, 2025, № 4 103
51. A. Khazri, Faster RCNN Object detection. 2019. Available: https://towardsdatascience.
com/faster-rcnn-object-detection-f865e5ed7fc4
52. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-
Time Object Detection. 2016, 10 p. doi: 10.48550/ARXIV.1506.02640. Available:
https://arxiv.org/abs/1506.02640
53. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, “Selective
Search for Object Recognition”, in International Journal of Computer Vision, 2013,
14 p. doi: 10.1007/s11263-013-0620-5
54. M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks.
2013, 11 p. doi: 10.48550/ARXIV.1311.2901. Available: https://arxiv.org/abs/1311.2901
55. R. Kundu, YOLO: Algorithm for Object Detection Explained [+Examples]. 2023.
Available: https://www.v7labs.com/blog/yolo-object-detection
56. S.-H. Tsang, Brief Review: YOLOv5 for Object Detection. 2023. Available: https://
sh-tsang.medium.com/brief-review-yolov5-for-object-detection-84cc6c6a0e3a
Received 25.12.2023
INFORMATION ON THE ARTICLE
Maksym A. Shvandt, ORCID: 0000-0002-4580-3961, Odesa I.I. Mechnikov National
University, Ukraine, e mail: maxim.shvandt@gmail.com
Volodymyr V. Moroz, ORCID: 0000-0002-3240-4590, Odesa I.I. Mechnikov National
University, Ukraine, e mail: v.moroz@onu.edu.ua
ОГЛЯД МЕТОДІВ І МОДЕЛЕЙ ДЕТЕКТУВАННЯ ОБ’ЄКТІВ НА БАЗІ
НЕЙРОННИХ МЕРЕЖ НА ПРИКЛАДІ ЇХ ЗАСТОСУВАННЯ ДЛЯ
СПОСТЕРЕЖЕННЯ ЗА ЛАБОРАТОРНИМИ ТВАРИНАМИ / М.А.Швандт,
В.В. Мороз
Анотація. Наведено стислий огляд найпоширеніших базових моделей нейрон-
них мереж для виявлення об’єктів. Потреба в автоматизації процесів спосте-
реження та нагляду постійно зростає. Одним із ключових завдань таких проце-
сів є виявлення об’єкта, що цікавить, для подальшого його аналізу. Було
запропоновано багато основних алгоритмів і підходів виявлення об’єктів, але
більшість із них, зазвичай, мають деякі обмеження щодо області застосування.
Здебільшого ці обмеження зумовлені характером спостережуваного середови-
ща або через те, що підходи до виявлення залежать від окремих характеристик
об’єкта, як-от лише колір або деякі основні форми. Для вирішення цих про-
блем був розроблений загалом новий підхід до виявлення об’єктів із викорис-
танням нейронних мереж. Подано основи та основні аспекти найбільш поши-
рених моделей нейронних мереж для виявлення об’єктів. Експеримент
продемонстрував особливості, переваги та недоліки досліджуваних методів
при застосуванні для виявлення лабораторних тварин під час вивчення їх по-
ведінки. З огляду на це зроблено висновки та надано рекомендації щодо їх ви-
користання.
Ключові слова: детектування об’єктів, нейронна мережа, нейронний шар, ар-
хітектура, модель, оптимізація, оцінка, прогноз, відео, зображення, кадр, зад-
ній фон, передній фон, експеримент, порівняння.
|
| id | journaliasakpiua-article-351422 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2026-02-08T08:06:11Z |
| publishDate | 2025 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/6d/7b8330ada1511ce3d5d247e75b63ee6d.pdf |
| spelling | journaliasakpiua-article-3514222026-02-02T20:49:24Z Overview of neural network object detection methods & models on the example of their use for lab animal observation Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами Shvandt, Maksym Moroz, Volodymyr object detection neural network neural layer architecture model optimization estimation prediction video image frame background foreground experiment comparison детектування об’єктів нейронна мережа нейронний шар архітектура модель оптимізація оцінка прогноз відео зображення кадр задній фон передній фон експеримент порівняння This article provides a brief overview of a set of the most common basic object detection neural network models. Today, the need for automating surveillance and observation processes remains a growing trend. Moreover, one of the key tasks of such processes is usually the detection of an object of interest for further analysis. Previously, many basic object detection algorithms and approaches have been proposed; however, most of them typically have limitations in terms of their applicability. In most cases, these limitations arise due to the nature of the observed environment or because the detection approaches rely on specific object characteristics, such as color or basic shapes only. To address these problems, a new approach for object detection has been developed using neural networks. This paper presents the basis and central aspects of the most common neural network object detection models. The experiment has demonstrated the features, advantages, and disadvantages of the studied methods in the application case of lab animal detection during their behavioral study. Considering this, conclusions and recommendations for their usage cases were made. Наведено стислий огляд найпоширеніших базових моделей нейронних мереж для виявлення об’єктів. Потреба в автоматизації процесів спостереження та нагляду постійно зростає. Одним із ключових завдань таких процесів є виявлення об’єкта, що цікавить, для подальшого його аналізу. Було запропоновано багато основних алгоритмів і підходів виявлення об’єктів, але більшість із них, зазвичай, мають деякі обмеження щодо області застосування. Здебільшого ці обмеження зумовлені характером спостережуваного середовища або через те, що підходи до виявлення залежать від окремих характеристик об’єкта, як-от лише колір або деякі основні форми. Для вирішення цих проблем був розроблений загалом новий підхід до виявлення об’єктів із використанням нейронних мереж. Подано основи та основні аспекти найбільш поширених моделей нейронних мереж для виявлення об’єктів. Експеримент продемонстрував особливості, переваги та недоліки досліджуваних методів при застосуванні для виявлення лабораторних тварин під час вивчення їх поведінки. З огляду на це зроблено висновки та надано рекомендації щодо їх використання. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2025-12-29 Article Article Peer-reviewed Article application/pdf https://journal.iasa.kpi.ua/article/view/351422 10.20535/SRIT.2308-8893.2025.4.05 System research and information technologies; No. 4 (2025); 71-103 Системные исследования и информационные технологии; № 4 (2025); 71-103 Системні дослідження та інформаційні технології; № 4 (2025); 71-103 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/351422/338441 |
| spellingShingle | детектування об’єктів нейронна мережа нейронний шар архітектура модель оптимізація оцінка прогноз відео зображення кадр задній фон передній фон експеримент порівняння Shvandt, Maksym Moroz, Volodymyr Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title | Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title_alt | Overview of neural network object detection methods & models on the example of their use for lab animal observation |
| title_full | Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title_fullStr | Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title_full_unstemmed | Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title_short | Огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| title_sort | огляд методів і моделей детектування об’єктів на базі нейронних мереж на прикладі їх застосування для спостереження за лабораторними тваринами |
| topic | детектування об’єктів нейронна мережа нейронний шар архітектура модель оптимізація оцінка прогноз відео зображення кадр задній фон передній фон експеримент порівняння |
| topic_facet | object detection neural network neural layer architecture model optimization estimation prediction video image frame background foreground experiment comparison детектування об’єктів нейронна мережа нейронний шар архітектура модель оптимізація оцінка прогноз відео зображення кадр задній фон передній фон експеримент порівняння |
| url | https://journal.iasa.kpi.ua/article/view/351422 |
| work_keys_str_mv | AT shvandtmaksym overviewofneuralnetworkobjectdetectionmethodsampmodelsontheexampleoftheiruseforlabanimalobservation AT morozvolodymyr overviewofneuralnetworkobjectdetectionmethodsampmodelsontheexampleoftheiruseforlabanimalobservation AT shvandtmaksym oglâdmetodívímodelejdetektuvannâobêktívnabazínejronnihmerežnaprikladííhzastosuvannâdlâsposterežennâzalaboratornimitvarinami AT morozvolodymyr oglâdmetodívímodelejdetektuvannâobêktívnabazínejronnihmerežnaprikladííhzastosuvannâdlâsposterežennâzalaboratornimitvarinami |