Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS
While deep learning models have significantly advanced player detection in sports analytics, accurately identi fying the football remains a persistent challenge due to its small size, rapid movement, frequent occlusions, and visual similarity to other elements such as player socks, logos, and field...
Gespeichert in:
| Datum: | 2025 |
|---|---|
| Hauptverfasser: | , |
| Format: | Artikel |
| Sprache: | English |
| Veröffentlicht: |
PROBLEMS IN PROGRAMMING
2025
|
| Schlagworte: | |
| Online Zugang: | https://pp.isofts.kiev.ua/index.php/ojs1/article/view/837 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Назва журналу: | Problems in programming |
| Завантажити файл: | |
Institution
Problems in programming| id |
pp_isofts_kiev_ua-article-837 |
|---|---|
| record_format |
ojs |
| resource_txt_mv |
ppisoftskievua/09/4afe75d443b17e2ab494665fcd8bd409.pdf |
| spelling |
pp_isofts_kiev_ua-article-8372025-11-03T11:08:31Z Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS Підвищення точності виявлення м'яча у відео футбольних матчів за допомогою механізмів уваги в CNN-моделях на основі FPN Ivasenko, I.B. Bishyr, S.S. ball Detection; deep learning; football video analysis; object detection; attention mechanisms; feature pyramid network UDC 004.932.2:796.332 виявлення м’яча; глибоке навчання; аналіз футбольного відео; виявлення об’єктів; меха нізми уваги; Feature Pyramid Network УДК 004.932.2:796.332 While deep learning models have significantly advanced player detection in sports analytics, accurately identi fying the football remains a persistent challenge due to its small size, rapid movement, frequent occlusions, and visual similarity to other elements such as player socks, logos, and field markings. This limitation significantly reduces the effectiveness of automated systems in comprehensively analyzing football matches, particularly in applications such as tactical event recognition, shot classification, and game state prediction. In this paper, we propose a method to improve ball detection accuracy in football videos by enhancing an existing architecture based on Feature Pyramid Networks (FPN). The original FPN-based model, although efficient for detecting large-scale players, shows limited performance in detecting small objects such as the ball. To address this, we integrate lightweight attention mechanisms to help the model focus on more relevant spatial and semantic fea tures. Specifically, we introduce Squeeze-and-Excitation (SE) layers into the backbone of the network to perform channel-wise feature recalibration and embed a Convolutional Block Attention Module (CBAM) into the ball detection head to refine both spatial and channel-level attention. These modifications are designed to enhance the network’s ability to distinguish the ball from cluttered backgrounds and visually similar objects. Our exper iments, conducted on the ISSIA-CNR and Soccer Player Detection datasets, demonstrate that the proposed at tention-augmented model achieves improved ball classification accuracy compared to the baseline, with no deg radation in player detection performance. These results validate the utility of lightweight attention mechanisms in the context of small object detection and provide a promising direction for more robust and real-time football video analysis systems.Prombles in programming 2025; 2: 54-62 Попри значний прогрес у виявленні гравців завдяки моделям глибокого навчання в спортивній аналітиці, точне розпізнавання футбольного м’яча залишається складною задачею через його малий розмір, швидкий рух, часті оклюзії та візуальну подібність до інших елементів, таких як гетри гравців, логотипи та розмітка поля. Ці обмеження значно знижують ефективність автоматизованих систем для комплексного аналізу фу тбольних матчів, особливо в таких задачах, як розпізнавання тактичних подій, класифікація ударів і про гнозування ігрових станів. У цій роботі запропоновано метод підвищення точності виявлення м’яча у відео футбольних матчів шляхом удосконалення наявної архітектури на основі Feature Pyramid Networks (FPN). Базова модель на основі FPN, хоча й ефективна для виявлення гравців, демонструє обмежену продуктив ність у розпізнаванні дрібних об’єктів, таких як м’яч. Для вирішення цієї проблеми ми інтегрували легкі механізми уваги, які дозволяють моделі краще зосереджуватись на релевантних просторових та семантич них ознаках. Зокрема, ми впроваджуємо шари Squeeze-and-Excitation (SE) у базову мережу для переналаш тування ознак на рівні каналів, а також додаємо модуль CBAM (Convolutional Block Attention Module) до голови виявлення м’яча для уточнення просторової та канальної уваги. Ці модифікації покликані покра щити здатність мережі відрізняти м’яч від візуально схожих об’єктів і перевантаженого фону. Наші експе рименти, проведені на наборах даних ISSIA-CNR та Soccer Player Detection, демонструють, що запропоно вана модель з увагою досягає кращої точності класифікації м’яча порівняно з базовим підходом, без погір шення точності виявлення гравців. Отримані результати підтверджують ефективність легких механізмів уваги в задачах виявлення дрібних об’єктів та відкривають перспективи для створення більш надійних і реалістичних систем аналізу футбольних відео у реальному часі.Prombles in programming 2025; 2: 54-62 PROBLEMS IN PROGRAMMING ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ ПРОБЛЕМИ ПРОГРАМУВАННЯ 2025-09-07 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/837 10.15407/pp2025.02.054 PROBLEMS IN PROGRAMMING; No 2 (2025); 54-62 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 2 (2025); 54-62 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 2 (2025); 54-62 1727-4907 10.15407/pp2025.02 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/837/888 Copyright (c) 2025 PROBLEMS IN PROGRAMMING |
| institution |
Problems in programming |
| baseUrl_str |
https://pp.isofts.kiev.ua/index.php/ojs1/oai |
| datestamp_date |
2025-11-03T11:08:31Z |
| collection |
OJS |
| language |
English |
| topic |
ball Detection deep learning football video analysis object detection attention mechanisms feature pyramid network UDC 004.932.2:796.332 |
| spellingShingle |
ball Detection deep learning football video analysis object detection attention mechanisms feature pyramid network UDC 004.932.2:796.332 Ivasenko, I.B. Bishyr, S.S. Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| topic_facet |
ball Detection deep learning football video analysis object detection attention mechanisms feature pyramid network UDC 004.932.2:796.332 виявлення м’яча глибоке навчання аналіз футбольного відео виявлення об’єктів меха нізми уваги Feature Pyramid Network УДК 004.932.2:796.332 |
| format |
Article |
| author |
Ivasenko, I.B. Bishyr, S.S. |
| author_facet |
Ivasenko, I.B. Bishyr, S.S. |
| author_sort |
Ivasenko, I.B. |
| title |
Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| title_short |
Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| title_full |
Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| title_fullStr |
Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| title_full_unstemmed |
Enhancing ball detection in football videos using attention mechanisms in FPN-based CNNS |
| title_sort |
enhancing ball detection in football videos using attention mechanisms in fpn-based cnns |
| title_alt |
Підвищення точності виявлення м'яча у відео футбольних матчів за допомогою механізмів уваги в CNN-моделях на основі FPN |
| description |
While deep learning models have significantly advanced player detection in sports analytics, accurately identi fying the football remains a persistent challenge due to its small size, rapid movement, frequent occlusions, and visual similarity to other elements such as player socks, logos, and field markings. This limitation significantly reduces the effectiveness of automated systems in comprehensively analyzing football matches, particularly in applications such as tactical event recognition, shot classification, and game state prediction. In this paper, we propose a method to improve ball detection accuracy in football videos by enhancing an existing architecture based on Feature Pyramid Networks (FPN). The original FPN-based model, although efficient for detecting large-scale players, shows limited performance in detecting small objects such as the ball. To address this, we integrate lightweight attention mechanisms to help the model focus on more relevant spatial and semantic fea tures. Specifically, we introduce Squeeze-and-Excitation (SE) layers into the backbone of the network to perform channel-wise feature recalibration and embed a Convolutional Block Attention Module (CBAM) into the ball detection head to refine both spatial and channel-level attention. These modifications are designed to enhance the network’s ability to distinguish the ball from cluttered backgrounds and visually similar objects. Our exper iments, conducted on the ISSIA-CNR and Soccer Player Detection datasets, demonstrate that the proposed at tention-augmented model achieves improved ball classification accuracy compared to the baseline, with no deg radation in player detection performance. These results validate the utility of lightweight attention mechanisms in the context of small object detection and provide a promising direction for more robust and real-time football video analysis systems.Prombles in programming 2025; 2: 54-62 |
| publisher |
PROBLEMS IN PROGRAMMING |
| publishDate |
2025 |
| url |
https://pp.isofts.kiev.ua/index.php/ojs1/article/view/837 |
| work_keys_str_mv |
AT ivasenkoib enhancingballdetectioninfootballvideosusingattentionmechanismsinfpnbasedcnns AT bishyrss enhancingballdetectioninfootballvideosusingattentionmechanismsinfpnbasedcnns AT ivasenkoib pídviŝennâtočnostíviâvlennâmâčauvídeofutbolʹnihmatčívzadopomogoûmehanízmívuvagivcnnmodelâhnaosnovífpn AT bishyrss pídviŝennâtočnostíviâvlennâmâčauvídeofutbolʹnihmatčívzadopomogoûmehanízmívuvagivcnnmodelâhnaosnovífpn |
| first_indexed |
2025-09-17T09:25:06Z |
| last_indexed |
2025-11-04T02:10:20Z |
| _version_ |
1850410461774741504 |
| fulltext |
Штучний інтелект
54
© І.Б. Івасенко, С.С. Бішир, 2025
ISSN 1727-4907. Проблеми програмування. 2025. №2
УДК 004.932.2:796.332 https://doi.org/10.15407/pp2025.02.054
І.Б. Івасенко, С.С. Бішир
ПІДВИЩЕННЯ ТОЧНОСТІ ВИЯВЛЕННЯ М'ЯЧА У ВІДЕО
ФУТБОЛЬНИХ МАТЧІВ ЗА ДОПОМОГОЮ МЕХАНІЗМІВ
УВАГИ В CNN-МОДЕЛЯХ НА ОСНОВІ FPN
Попри значний прогрес у виявленні гравців завдяки моделям глибокого навчання в спортивній аналітиці,
точне розпізнавання футбольного м’яча залишається складною задачею через його малий розмір, швидкий
рух, часті оклюзії та візуальну подібність до інших елементів, таких як гетри гравців, логотипи та розмітка
поля. Ці обмеження значно знижують ефективність автоматизованих систем для комплексного аналізу фу-
тбольних матчів, особливо в таких задачах, як розпізнавання тактичних подій, класифікація ударів і про-
гнозування ігрових станів. У цій роботі запропоновано метод підвищення точності виявлення м’яча у відео
футбольних матчів шляхом удосконалення наявної архітектури на основі Feature Pyramid Networks (FPN).
Базова модель на основі FPN, хоча й ефективна для виявлення гравців, демонструє обмежену продуктив-
ність у розпізнаванні дрібних об’єктів, таких як м’яч. Для вирішення цієї проблеми ми інтегрували легкі
механізми уваги, які дозволяють моделі краще зосереджуватись на релевантних просторових та семантич-
них ознаках. Зокрема, ми впроваджуємо шари Squeeze-and-Excitation (SE) у базову мережу для переналаш-
тування ознак на рівні каналів, а також додаємо модуль CBAM (Convolutional Block Attention Module) до
голови виявлення м’яча для уточнення просторової та канальної уваги. Ці модифікації покликані покра-
щити здатність мережі відрізняти м’яч від візуально схожих об’єктів і перевантаженого фону. Наші експе-
рименти, проведені на наборах даних ISSIA-CNR та Soccer Player Detection, демонструють, що запропоно-
вана модель з увагою досягає кращої точності класифікації м’яча порівняно з базовим підходом, без погір-
шення точності виявлення гравців. Отримані результати підтверджують ефективність легких механізмів
уваги в задачах виявлення дрібних об’єктів та відкривають перспективи для створення більш надійних і
реалістичних систем аналізу футбольних відео у реальному часі.
Ключові слова: виявлення м’яча, глибоке навчання, аналіз футбольного відео, виявлення об’єктів, меха-
нізми уваги, Feature Pyramid Network
I.B. Ivasenko, S.S. Bishyr
ENHANCING BALL DETECTION IN FOOTBALL VIDEOS
USING ATTENTION MECHANISMS IN FPN-BASED CNNS
While deep learning models have significantly advanced player detection in sports analytics, accurately identi-
fying the football remains a persistent challenge due to its small size, rapid movement, frequent occlusions, and
visual similarity to other elements such as player socks, logos, and field markings. This limitation significantly
reduces the effectiveness of automated systems in comprehensively analyzing football matches, particularly in
applications such as tactical event recognition, shot classification, and game state prediction. In this paper, we
propose a method to improve ball detection accuracy in football videos by enhancing an existing architecture
based on Feature Pyramid Networks (FPN). The original FPN-based model, although efficient for detecting
large-scale players, shows limited performance in detecting small objects such as the ball. To address this, we
integrate lightweight attention mechanisms to help the model focus on more relevant spatial and semantic fea-
tures. Specifically, we introduce Squeeze-and-Excitation (SE) layers into the backbone of the network to perform
channel-wise feature recalibration and embed a Convolutional Block Attention Module (CBAM) into the ball
detection head to refine both spatial and channel-level attention. These modifications are designed to enhance
the network’s ability to distinguish the ball from cluttered backgrounds and visually similar objects. Our exper-
iments, conducted on the ISSIA-CNR and Soccer Player Detection datasets, demonstrate that the proposed at-
tention-augmented model achieves improved ball classification accuracy compared to the baseline, with no deg-
radation in player detection performance. These results validate the utility of lightweight attention mechanisms
in the context of small object detection and provide a promising direction for more robust and real-time football
video analysis systems.
Keywords: Ball Detection, Deep learning, Football Video Analysis, Object Detection, Attention mechanisms,
Feature Pyramid Network
Штучний інтелект
55
Introduction
Ball detection plays a crucial part in the
automated analysis of football matches,
enabling advanced tasks such as event
detection, match analysis, and performance
assessment [1] [2]. However, accurate ball
detection remains challenging because of its
small size, fast movement, occlusions, and
similar appearance to other elements, such as
player socks, goalkeeper gloves, or field lines
[3] [4]. While deep learning methods,
especially convolutional neural networks
(CNNs), have significantly advanced the state
of object detection in sports analytics, existing
approaches often struggle with reliably
identifying the ball under diverse match
conditions [5] [6].
Feature Pyramid Networks (FPN) are a
promising approach to object detection in
complex scenes [7]. They allow for effective
multi-scale feature extraction by combining
low and high-level features. A recent study
proposed an FPN-based approach as an
integrated ball and player detector in footage
from football matches [3]. The approach
demonstrated a strong performance in player
detection. Nonetheless, the same approach
showed comparatively lower accuracy in the
ball detection tasks because of the small size,
high speed, frequent occlusions, and visual
similarity with other objects. As a result, even
the state-of-the-art models for object detection,
such as YOLO [8] and SSD [9], frequently
missidentify or completely miss the detection
of small, fast-moving targets [10] [11] [12].
This indicates a need for further refinement to
increase the effectiveness of detecting small
and fast-moving objects.
This paper addresses that specific limi-
tation. We aim to enhance the ball detection
performance of an existing FPN-based archi-
tecture by integrating lightweight attention
mechanisms. Our approach is based on the re-
cent success of applying an attention mecha-
nism to improve the performance of small ob-
ject detection in remote sensing [15], aerial im-
agery [16], and medical imaging [17]. Addi-
tionally, the idea of enhancing FPNs with at-
tention is supported by the work Attentional
Feature Pyramid Network (AFPN) proposed
by Min et al. [18].
The remainder of this paper is organized in
the following structure:
• Section 2 discusses related work on object
detection in football and attention mecha-
nisms.
• Section 3 provides the methodology, in-
cluding the original architecture and our
proposed enhancements.
• Section 4 describes the setup and the out-
come of the experiments.
• Section 5 presents an analysis of the work.
• In conclusion, Section 6 summarizes the
work and outlines future research direc-
tions.
Related Work
Ball Detection in Football Analytics. Object
detection has become essential to football
video analytics, helping recognize players, the
ball, and key events such as shots [1] [2]. Tra-
ditional computer vision approaches relied on
handcrafted features and motion tracking [5]
but struggled in scenarios involving occlusion,
fast motion, or cluttered backgrounds. With the
help of deep learning, CNN-based methods
have achieved better performance in sports an-
alytics tasks.
Recent studies have employed archi-
tectures like YOLO [8] and SSD [9] for real-
time player and ball detection. However, these
models often struggle to detect small objects
like the ball, especially in low-resolution
frames or when the ball is partially occluded
[5] [6]. The FPN-based base model used in this
work represents an improvement by leveraging
multi-scale feature maps, improving the detec-
tion of large and small objects [3] [19]. Despite
this, the detection accuracy for the ball re-
mained lower than for players, motivating fur-
ther research into specialized enhancements.
Feature Pyramid Networks (FPN). The Fea-
ture Pyramid Network (FPN) [7], introduced
by Lin et al., is a widely adopted architecture
for multi-scale object detection. It enhances a
backbone CNN (e.g., ResNet) by creating a
Штучний інтелект
56
top-down pathway and lateral connections that
fuse semantic-rich features from higher layers
with detailed spatial features from earlier lay-
ers. FPN models are especially effective in ob-
ject detection within the same image at differ-
ent scales [10]. They are well-suited for com-
plex scenes like football fields, where players
and the ball vary in size and appearance. How-
ever, even with FPN’s multi-scale approach,
small objects like the ball can remain hard to
detect due to weak spatial cues or low contrast.
Some works, such as the Attentional Feature
Pyramid Network (AFPN) [18], further en-
hance FPNs by introducing attention mecha-
nisms to better focus on important features at
multiple scales.
Attention Mechanisms in CNNs. Attention
mechanisms are powerful tools that enhance
feature representation in CNNs, emphasizing
important information while suppressing irrel-
evant noise. Two modules used in our work
are:
• Squeeze-and-Excitation (SE) blocks, pro-
posed by Hu et al. [13], introduce channel-
wise attention by modeling the interde-
pendencies between feature channels. This
allows the network to recalibrate the im-
portance of different channels, leading to
improved discriminative ability, especially
in cluttered scenes.
• Convolutional Block Attention Module
(CBAM), proposed by Woo et al. [14], ex-
tends this idea by incorporating channel
and spatial attention. CBAM sequentially
applies channel attention followed by spa-
tial attention to refine the feature maps,
making it particularly effective for tasks in-
volving small and occluded objects.
Several studies have demonstrated that in-
tegrating SE or CBAM modules into standard
CNNs improves performance across tasks such
as remote sensing [15] [16], image classifica-
tion, object detection [10] [11] [12], and seg-
mentation [17]. However, their application to
sports analytics, particularly for small object
detection in dynamic environments, has been
limited. In this paper, we explore the benefits
of applying SE and CBAM to enhance the ball
detection capability of an FPN-based network.
Methodology
In this section, we first describe the
baseline architecture (FootAndBall) [3] that
serves as the foundation for our work. Then,
we present the proposed modifications that in-
volve integrating attention mechanisms —
Squeeze-and-Excitation (SE) [13] and Convo-
lutional Block Attention Module (CBAM) [14]
— to improve the detection of small, challeng-
ing objects such as a ball.
Integration of SE Block in Backbone. We
add a Squeeze-and-Excitation (SE) [13] mod-
ule after the first, third, and fifth convolutional
blocks (Conv1, Conv3, and Conv5) in the
backbone. The SE block works by performing
global average pooling across each channel of
the feature map, creating a channel descriptor
that passes through two fully connected layers
with a ReLU and sigmoid activation to learn
the importance of each channel. The output is
used to reweight the input feature map chan-
nels:
𝐹𝐹𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐹𝐹 ⋅ 𝜎𝜎 (𝑊𝑊1 ⋅ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑊𝑊2 ⋅ 𝐺𝐺𝐺𝐺𝐺𝐺(𝐹𝐹))) , (1)
where 𝐹𝐹 is the input feature map 𝐺𝐺𝐺𝐺𝐺𝐺 is global
average pooling, and 𝑊𝑊1 , 𝑊𝑊2 are the learned
weights. This allows the network to focus on
informative feature channels and improve the
representation of small objects [13] [20]. Fig.
1 represents a diagram of the SE block.
Fig. 1. Squeeze-and-excitation block
CBAM in Ball Classifier Head. The Convo-
lutional Block Attention Module (CBAM) [14]
enhances channel and spatial attention. We ap-
ply CBAM to the output feature map before the
ball classification head. CBAM sequentially
applies:
Штучний інтелект
57
1. Channel attention uses average and max
pooling along the spatial dimension fol-
lowed by shared MLP layers.
2. Spatial attention, using a convolution over
concatenated average-pooled and max-
pooled feature maps across channels.
This results in a refined feature map:
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐹𝐹) = 𝑆𝑆𝐶𝐶(𝐶𝐶𝐶𝐶(𝐹𝐹)) ⋅ 𝐹𝐹, (2)
𝐶𝐶𝐶𝐶(𝐹𝐹) = 𝜎𝜎 (𝑊𝑊1 (𝑊𝑊0(𝐹𝐹𝑎𝑎𝑎𝑎𝑎𝑎𝑐𝑐 )) +𝑊𝑊1(𝑊𝑊0(𝐹𝐹𝑚𝑚𝑎𝑎𝑚𝑚
𝑐𝑐 ))) , (3)
𝑆𝑆𝐶𝐶(𝐹𝐹) = 𝜎𝜎 (𝑓𝑓7×7([𝐹𝐹𝑎𝑎𝑎𝑎𝑎𝑎𝑠𝑠 , 𝐹𝐹𝑚𝑚𝑎𝑎𝑚𝑚
𝑠𝑠 ])) , (4)
The schematic representation of the CBAM
architecture is illustrated in Fig. 2.
Fig. 2. Convolutional Block Attention Module
(CBAM). The top diagram provides a general
overview of the CBAM architecture. The
middle diagram details the Channel Attention
Module. The bottom diagram illustrates the
Spatial Attention Module
Modified Network Architecture Overview.
The overall architecture remains fully convo-
lutional and lightweight but with improved at-
tention modeling. The SE-enhanced backbone
generates richer feature maps, while the
CBAM-augmented detection head improves
ball localization precision. Fig. 3 illustrates the
modified network architecture, where the SE
modules are integrated into the first, third, and
fifth convolutional blocks, and the CMAB
module is integrated into the ball classification
layer. A schematic comparison between the
original and modified models is provided in
Table 1.
Fig. 3. The modified network architecture includes SE layers in blocks Conv1, Conv3, and Conv5,
and a CBAM layer in the Ball classifier head
Штучний інтелект
58
Table 1.
Comparison of original and modified network
architectures
Block FootAndBall
layers
Modified
Network Ar-
chitecture
layers
Output
size
Conv1 16 filters 3x3
MaxPool 2x2
16 filters 3x3
SE block
MaxPool 2x2
w/2, h/2,
16
Conv2 32 filters 3x3
32 filters 3x3
MaxPool 2x2
32 filters 3x3
32 filters 3x3
MaxPool 2x2
w/4, h/4,
32
Conv3 32 filters 3x3
32 filters 3x3
MaxPool 2x2
32 filters 3x3
32 filters 3x3
SE block
MaxPool 2x2
w/8, h/8,
32
Conv4 64 filters 3x3
64 filters 3x3
MaxPool 2x2
64 filters 3x3
64 filters 3x3
MaxPool 2x2
w/16,
h/16, 64
Conv5 64 filters 3x3
64 filters 3x3
MaxPool 2x2
64 filters 3x3
64 filters 3x3
SE block
MaxPool 2x2
w/32,
h/32, 64
1x1Conv1 32 filters 1x1 32 filters 1x1 w/4, h/4,
32
1x1Conv2 32 filters 1x1 32 filters 1x1 w/8, h/8,
32
1x1Conv3 32 filters 1x1 32 filters 1x1 w/16,
h/16, 32
1x1Conv4 32 filters 1x1 32 filters 1x1 w/32,
h/32, 32
Ball clas-
sifier
32 filters 3x3
2 filters 3x3
Sigmoid
32 filters 3x3
CBAM
2 filters 3x3
Sigmoid
w/4, h/4,
1
Player
classifier
32 filters 3x3
2 filters 3x3
Sigmoid
32 filters 3x3
2 filters 3x3
Sigmoid
w/16,
h/16, 1
BBox re-
gressor
32 filters 3x3
4 filters 3x3
32 filters 3x3
4 filters 3x3
w/16,
h/16, 4
Loss Function. We adopt the same loss func-
tion as in the original FootAndBall model, con-
sisting of:
• Binary cross-entropy losses for ball and
player classification.
• Smooth L1 loss for bounding box regres-
sion, as used in SSD [9] [21].
Let 𝐿𝐿𝑏𝑏, 𝐿𝐿𝑝𝑝, 𝐿𝐿𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 represent the ball classifica-
tion loss, player classification loss, and player
bounding box loss, respectively. The total loss
is computed as:
𝐿𝐿 = 1
𝑁𝑁 (𝛼𝛼𝐿𝐿𝑏𝑏 + 𝛽𝛽𝐿𝐿𝑝𝑝 + 𝐿𝐿𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏) (5)
where 𝛼𝛼 and 𝛽𝛽 are weighting coefficients, and
𝑁𝑁 is the number of examples in a batch.
Experiments
In this section, we describe the experi-
mental configuration used to evaluate the ef-
fectiveness of the proposed modifications to
the FootAndBall architecture. We assess the
performance of our proposed architecture,
which integrates SE and CBAM modules, and
compare it with the original model.
Datasets. We used the same two datasets that
were used in the baseline study:
• ISSIA-CNR Soccer Dataset [5]: Contains
20,000 annotated frames from professional
matches recorded using six synchronized
Full HD cameras. Each frame is labeled
with ball positions and player bounding
boxes.
• Soccer Player Detection Dataset [22]:
Composed of 2,019 images captured from
two professional football matches, anno-
tated with over 22,000 player locations.
Ball positions are not annotated in this da-
taset.
As in the original paper, we split each dataset
into 80% for training data and 20% for evalua-
tion [3]. Both datasets contain a range of chal-
lenges like motion blur, occlusions, and back-
ground clutter.
Implementation details. We implemented the
model in PyTorch and trained it using Adam
optimizer [23] with a 4-step learning rate
scheduler. The initial learning rate was set to
0.001 and decreased by a factor of 10 at the 10th,
25th, 50th, and 75th epochs. This gradual decay
enabled the model to converge quickly in the
early training phases and allowed a fine-
grained adjustment in later stages. The training
was performed on the NVIDIA RTX 4000 Ada
Generation GPU. The hyperparameters for the
training are represented in Table 2.
Штучний інтелект
59
Table 2.
Training hyperparameters
Optimizer Adam
Initial learning
rate
0.001
Learning rate de-
cay
x0.1 at epochs 10, 25, 50,
and 75
Epochs 100
Batch size 16
To enhance the generalization, we applied data
augmentation techniques, including random
cropping and flipping [24].
Evaluation metrics. We use the Average Pre-
cision (AP) metric, a standard object detection
metric described in the Pascal VOC challenge
[25]. Ball detection AP is computed based on
maximum values in the confidence map
matching the ground truth position. Player de-
tection is calculated based on predicted bound-
ing boxes with an Intersection over Union
(IoU) threshold of 0.5. We also report model
size (number of trainable parameters) to evalu-
ate efficiency.
Results. Table 3 compares the original model
with the proposed enhanced version. We com-
pare the Average Precision for Ball and Player
detection on the ISSIA-CNR dataset and
player detection on the Soccer Player Detec-
tion dataset.
Table 3.
Evaluation results of the original model in
comparison with the enhanced model
Model Ball
AP
Player
AP
(IS-
SIA)
Mean
AP
Player
AP
(SPD)
Params
FootAndBall0.909 0.921 0.915 0.885 199K
SE + CBAM 0.927 0.917 0.922 0.871 200K
Our final model with SE and CBAM
blocks shows the highest ball detection accu-
racy, outperforming the baseline by 2% AP
gain. Player detection performance is also
maintained at the same level. Despite the
added attention layers, the model remains
lightweight with a slightly increased number of
parameters. Fig. 4 illustrates a comparative
analysis of classification outcomes between
the original and proposed models, highlighting
instances where the original results are inade-
quate. In contrast, the proposed model success-
fully classifies the ball, demonstrating its en-
hanced efficacy.
Fig. 4. Comparison of ball classification
results: the top row displays failed
classifications from the original model, while
the bottom row illustrates successful
classifications from the proposed model
Discussion
The experimental results show that the
integration of Squeeze-and-Excitation (SE)
[13] and Convolutional Block Attention Mod-
ule (CBAM) [14] blocks into the FootAndBall
architecture improves the performance of the
model for the task of ball detection. This is a
significant advancement, as accurate ball de-
tection remains one of the most complicated
tasks in football video analysis owing to its
small size, frequent occlusions, motion blur,
and visual similarity to player gear and back-
ground elements [3] [4].
Adding SE blocks in the backbone en-
hances the model’s ability to emphasize in-
formative feature channels while suppressing
less relevant ones. This aligns with prior find-
ings that SE improves model sensitivity to sub-
tle visual cues in cluttered scenes [13] [20]. In
our case, the SE-enhanced backbone produces
stronger features for ball detection. Similar
channel-wise recalibration strategies have also
proven effective in other small object detection
contexts, such as traffic sign detection [11].
Including CBAM in the ball detection
head applies channel and spatial attention. It
allows the model to focus on small spatial re-
Штучний інтелект
60
gions with high semantic relevance, such as re-
gions that contain fast-moving objects. This
spatial attention appears to help to distinguish
false positives like white socks, pitch lines, or
advertisements from the ball distractors, which
is one of the frequent issues in the baseline
model.
Combining SE and CBAM yields the
highest accuracy, confirming their comple-
mentary nature. SE enhances global channel
interactions during feature extraction, while
CBAM introduces localized attention refine-
ments before detection [14]. Similar hybrid at-
tention strategies have succeeded in medical
image analysis [17] and aerial image object de-
tection [15] [16], where high-level semantics
and spatial precision are critical.
Despite the additional attention layers,
the proposed model remains comparably small
and capable of real-time performance. This
echoes trends in lightweight attention integra-
tion found in mobile-focused detection models
like MobileNetV3 [26]. Our enhancements in-
creased detection accuracy without a signifi-
cant trade-off in model size.
However, some challenges remain. The
model occasionally fails in edge cases involv-
ing heavy occlusion or extreme motion blur,
conditions common in real-world sports foot-
age. Fig. 5 illustrates challenging frames where
the model either could not detect the ball or in-
correctly identified it in its absence. Because
our system processes frames independently, it
cannot exploit temporal continuity to reinforce
uncertain predictions. Techniques such as tem-
poral feature aggregation or recurrent modules
have been shown to improve consistency [27]
[28] in video-based detection tasks and could
be beneficial here.
(a) (b)
Fig. 5. Examples of model misidentifications,
showing a false positive detection (a) and
missed detections (b) of the ball
Overall, the results support our hypoth-
esis that attention mechanisms considerably
enhance the detection of small, context-sensi-
tive objects in sports videos. The proposed ap-
proach balances accuracy and computational
efficiency, making it suitable for real-time
sports analytics systems.
Conclusion
This paper presents an enhanced deep
learning architecture for joint player and ball
detection in football match videos. Building on
the original FootAndBall model, we introduce
two attention mechanisms — Squeeze-and-Ex-
citation (SE) [13] and Convolutional Block At-
tention Module (CBAM) [14] — to enhance
the accuracy of ball detection, a task known to
be difficult due to its small size, high motion,
and frequent occlusion [4].
By integrating SE blocks into the fea-
ture extraction backbone, we enabled the net-
work to adaptively recalibrate channel-wise
feature responses adaptively, enhancing its dis-
criminative power in complex scenes [13] [20].
Additionally, incorporating CBAM into the
ball detection head improved the network’s
ability to focus on relevant spatial regions, sig-
nificantly increasing its precision in identify-
ing the ball amidst cluttered backgrounds. We
also proposed a 4-step learning rate schedule,
which helped improve training stability and
convergence over time.
Our experiments on the ISSIA-CNR
[5] and Soccer Player Detection [22] datasets
demonstrated that the proposed attention-
based enhancements lead to notable improve-
ments in detection accuracy, particularly for
the ball, increasing the accuracy by 2%, while
maintaining real-time inference speed and
model efficiency. These results validate the ef-
fectiveness of lightweight attention modules in
sports video analysis systems.
While the proposed model achieved
strong results, several opportunities for further
improvement exist, such as temporal modeling.
Our current approach operates on single
frames, without leveraging temporal con-
sistency. Incorporating temporal information
through optical flow, frame-level feature ag-
gregation, or recurrent networks (e.g., Con-
vLSTM or 3D CNNs) could enhance robust-
Штучний інтелект
61
ness, especially in motion blur or occlusion
scenarios [27] [28].
Overall, our results highlight that atten-
tion mechanisms are a promising avenue for
improving small-object detection in sports an-
alytics. The proposed system offers a solid
foundation for future research and real-world
applications in football match analysis by com-
bining architectural innovation with efficiency
considerations.
References
1. Bialkowski, P. Lucey, P. Carr, Y. Yue, S.
Sridharan and I. Matthews, "Large-Scale
Analysis of Soccer Matches Using
Spatiotemporal Tracking Data," in 2014 IEEE
International Conference on Data Mining,
December 2014. doi: 10.1109/ICDM.2014.133
2. M. Manafifard, H. Ebadi and H. Moghaddam,
"A Survey on Player Tracking in Soccer Videos.
Computer Vision and Image Understanding,"
Computer Vision and Image Understanding,
vol. 159, pp. 19-46, June, 2017. doi:
10.1016/j.cviu.2017.02.002
3. J. Komorowski, G. Kurzejamski and G. Sarwas,
"FootAndBall: Integrated Player and Ball
Detector," in 15th International Conference on
Computer Vision Theory and Applications, pp.
47-56, Valletta, Malta, January, 2020. doi:
10.5220/0008916000470056
4. P. Kamble, A. Keskar and K. Bhurchandi, "A
deep learning ball tracking system in soccer
videos," Opto-Electronics Review, vol. 27, no.
1, pp. 58-69, March, 2019. doi:
10.1016/j.opelre.2019.02.003
5. T. D'Orazio, M. Leo, N. Mosca, P. Spagnolo
and P. L. Mazzeo, "A Semi-automatic System
for Ground Truth Generation of Soccer Video
Sequences," in Sixth IEEE International Con-
ference on Advanced Video and Signal Based
Surveillance, Genova, Italy, September, 2009.
doi: 10.1109/AVSS.2009.69
6. T. Wang and T. Li, "Deep Learning-Based
Football Player Detection in Videos," Compu-
tational Intelligence and Neuroscience, pp. 1-8,
2022. doi: 10.1155/2022/3540642
7. T. -Y. Lin, P. Dollár, R. Girshick, K. He, H. B.
and S. Belongie, "Feature Pyramid Networks
for Object Detection," in IEEE Conference on
Computer Vision and Pattern Recognition
(CVPR), pp. 936-944, Honolulu, HI, USA,
2017. doi: 10.1109/CVPR.2017.106
8. J. Redmon and A. Farhadi, "YOLOv3: An In-
cremental Improvement," arXiv:1804.02767,
2018. doi: 10.48550/arXiv.1804.02767
9. W. Liu, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu and A. Berg, "SSD: Single Shot MultiBox
Detector," in European Conference on Com-
puter Vision, pp 21–37, 2016. doi:
10.1007/978-3-319-46448-0_2
10. Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li
and S. Hu, "Traffic-Sign Detection and Classi-
fication in the Wild," in Conference on Com-
puter Vision and Pattern Recognition (CVPR),
Las Vegas, NV, USA, June, 2016. doi:
10.1109/CVPR.2016.232
11. Y. Chen, J. Wang, Z. Dong, Y. Yang, Q. Luo
and M. Gao, "An Attention Based YOLOv5
Network for Small Traffic Sign Recognition,"
in IEEE 31st International Symposium on In-
dustrial Electronics (ISIE), Anchorage, AK,
USA, June, 2022. doi:
10.1109/ISIE51582.2022.9831717
12. S. Du, W. Pan, N. Li, S. Dai, B. Xu, H. Liu, C.
Xu and X. Li, "TSD‐YOLO: Small traffic
sign detection based on improved YOLO v8,"
ET Image Processing, vol. 18, June, 2024. doi:
10.1049/ipr2.13141
13. J. Qu, Z. Tang, L. Zhang, Y. Zhang and Z.
Zhang, "Remote Sensing Small Object Detec-
tion Network Based on Attention Mechanism
and Multi-Scale Feature Fusion," Remote
Sensing, vol. 15, p. 2728, May, 2023. doi:
10.3390/rs15112728
14. J. Rabbi, N. Ray, M. Schubert, S. Chowdhury
and D. Chao, "Small-Object Detection in Re-
mote Sensing Images with End-to-End Edge-
Enhanced GAN and Object Detector Net-
work," Remote Sensing, vol. 12, p. 1432, April,
2020. doi: 10.3390/rs12091432
15. O. Oktay, J. Schlemper, L. Folgoc, M. Lee, M.
Heinrich, K. Misawa, K. Mori, S. McDonagh,
N. Hammerla, B. Kainz, B. Glocker and D.
Rueckert, "Attention U-Net: Learning Where
to Look for the Pancreas," arXiv:1804.03999,
April, 2018. doi: 10.48550/arXiv.1804.03999
16. K. Min, G.-H. Lee and S.-W. Lee, "Attentional
feature pyramid network for small object de-
tection," Neural Networks, vol. 155, p. 439–
450, November, 2022. doi:
10.1016/j.neunet.2022.08.029
17. V. Renò, N. Mosca, R. Marani, M. Nitti and E.
Stella, "Convolutional Neural Networks Based
Ball Detection in Tennis Games," in
IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW),
Salt Lake City, UT, USA, June, 2018. doi:
10.1109/CVPRW.2018.00228
Штучний інтелект
62
18. J. Hu, L. Shen and G. Sun, "Squeeze-and-Ex-
citation Networks," in 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern
Recognition (CVPR), Salt Lake City, UT, USA,
June, 2018. doi: 10.1109/CVPR.2018.00745
19. S. Woo, J. Park, J.-Y. Lee and I. Kweon,
"CBAM: Convolutional Block Attention Mod-
ule," in European Conference on Computer Vi-
sion (ECCV), Munich, Germany, 2018.
20. H. Li, P. Xiong, J. An and L. Wang, "Pyramid
Attention Network for Semantic Segmenta-
tion," 10.48550/arXiv.1805.10180, pp. 3-19,
September, 2018. doi: 10.1007/978-3-030-
01234-2_1
21. R. Girshick, "Fast R-CNN," in 2015 IEEE In-
ternational Conference on Computer Vision
(ICCV), Santiago, Chile, December, 2015. doi:
10.1109/ICCV.2015.169
22. K. Lu, J. Chen, J. Little and H. He, "Light Cas-
caded Convolutional Neural Networks for Ac-
curate Player Detection,"
10.48550/arXiv.1709.10230, September, 2017.
doi: 10.48550/arXiv.1709.10230
23. D. Kingma and J. Ba, "Adam: A Method for
Stochastic Optimization," in International
Conference on Learning Representations, De-
cember, 2014. doi: 10.48550/arXiv.1412.6980
24. C. Shorten and T. Khoshgoftaar, "A survey on
Image Data Augmentation for Deep Learning,"
Journal of Big Data, vol. 6, no. 60, July, 2019.
doi: 10.1186/s40537-019-0197-0
25. M. Everingham, L. Van Gool, C. Williams, J.
Winn and A. Zisserman, "The Pascal Visual
Object Classes (VOC) challenge," Interna-
tional Journal of Computer Vision, vol. 88, pp.
303-338, 2010. doi: 10.1007/s11263-009-
0275-4
26. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B.
Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V.
Vasudevan, Q. Le and H. Adam, "Searching for
MobileNetV3," 10.48550/arXiv.1905.02244,
pp. 1314-1324, Seoul, Korea (South), 2019.
doi: 10.1109/ICCV.2019.00140
27. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin,
X. Tang and L. Van Gool, "Temporal Segment
Networks for Action Recognition in Videos,"
IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 41, no. 11, pp. 2740-
2755, November, 2019. doi:
10.1109/TPAMI.2018.2868668
28. A. Kompella and R. Kulkarni, "A semi-super-
vised recurrent neural network for video salient
object detection," Neural Computing and Ap-
plications, pp. 2065–2083, vol. 33, no. 6,
March 2021. doi: 10.1007/s00521-020-05081-
5
Одержано: 20.05.2025
Внутрішня рецензія отримана: 30.05.2025
Зовнішня рецензія отримана: 30.05.2025
Про авторів:
1,2Івасенко Ірина Богданівна,
доктор технічних наук,
старший науковий співробітник,
професор
e-mail: iryna.b.ivasenko@lpnu.ua
https://orcid.org/0000-0003-3795-9779
2Бішир Сергій Сергійович,
аспірант першого року навчання,
Національний університет
«Львівська політехніка»
e-mail: serhii.s.bishyr@lpnu.ua
https://orcid.org/0009-0009-1008-9292
Місце роботи авторів:
1Фізико-механічний інститут
Ім. Г. В. Карпенка НАН України
Тел.: +3(032) 263-30-88
79060, м. Львів, вул. Наукова 5,
e-mail: pminasu@ipm.lviv.ua
2Національний університет
«Львівська політехніка»
Тел.: +3(8032) 258-22-82,
79013, м. Львів, вул. Степана Бандери, 12,
e-mail: coffice@lpnu.ua, com.centre@lpnu.ua
|