Application of small language models for semantic analysis of Web interface accessibility

Web accessibility remains a critical aspect of ensuring equal opportunities for internet resource usage, especially for people with disabilities. The Web Content Accessibility Guidelines 2.5.3 criterion “Label in Name” requires that the accessible name of an interface component include text that is...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2025
Автори:	Kuzikov, B.O., Shovkoplias, O.A., Tytov, P.O., Shovkoplias, S.R.
Формат:	Стаття
Мова:	Англійська
Опубліковано:	PROBLEMS IN PROGRAMMING 2025
Теми:	Small Language Models Semantic Analysis Text Classification Web Accessibility Web Content Accessibility Guidelines Model Fine-Tuning Natural Language Processing UDC 004.89:004.51
Онлайн доступ:	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/839
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Problems in programming
Завантажити файл:

Репозитарії

Problems in programming

_version_	1859499765482389504
author	Kuzikov, B.O. Shovkoplias, O.A. Tytov, P.O. Shovkoplias, S.R.
author_facet	Kuzikov, B.O. Shovkoplias, O.A. Tytov, P.O. Shovkoplias, S.R.
author_sort	Kuzikov, B.O.
baseUrl_str	https://pp.isofts.kiev.ua/index.php/ojs1/oai
collection	OJS
datestamp_date	2025-11-03T11:33:19Z
description	Web accessibility remains a critical aspect of ensuring equal opportunities for internet resource usage, especially for people with disabilities. The Web Content Accessibility Guidelines 2.5.3 criterion “Label in Name” requires that the accessible name of an interface component include text that is visually presented. Existing automated verification methods for this criterion are predominantly based on primitive string comparison, which does not account for semantic context. Objective: investigate the possibilities of using small language models with up to 1 billion parameters for automated semantic analysis of compliance with Web Content Accessibility Guidelines 2.5.3, as an alternative to resource-intensive large language models and limited algorithmic methods. Methodol ogy: the research involved creating synthetic datasets (7,200 English-language and 5,615 Ukrainian-language samples) and using real-world datasets (Top500 – 380 samples, UaUniv – 319). Sentence Bidirectional Encoder Representations from Transformers models were tested for computing semantic similarity, and fine-tuning of the google/electra-base-discriminator model was performed for 3-class classification of semantic relationships (“similar”, “unrelated”, “opposite”). Results: the trained model of 437 MB demonstrated high accuracy on syn thetic data (0.96) and sufficient accuracy on real datasets (Top500: 0.77, UaUniv: 0.73). The model effectively identifies all three classes of semantic relationships with an accuracy of 95.1 % for “opposite”, 92.7 % for “un related”, and 97.4 % for “similar” texts in the validation sample. Conclusions: the research confirmed the feasi bility of using small language models for automated verification of semantic compliance according to Web Content Accessibility Guidelines 2.5.3. The proposed approach provides acceptable classification accuracy with significantly lower computational costs compared to large language models, allowing for the integration of se mantic analysis into standard development and testing processes. Despite certain limitations, the developed so lution can significantly improve web accessibility testing.Prombles in programming 2025; 2: 77-86
first_indexed	2025-09-17T09:25:11Z
format	Article
fulltext	Семантик Веб та лінгвістичні системи 77 © Б. О. Кузіков, О. А. Шовкопляс, П. О. Титов, С. Р. Шовкопляс, 2025 ISSN 1727-4907. Проблеми програмування. 2025. №2 УДК 004.89:004.51 https://doi.org/10.15407/pp2025.02.077 B. O. Kuzikov, O. A. Shovkoplias, P. O. Tytov, S. R. Shovkoplias APPLICATION OF SMALL LANGUAGE MODELS FOR SEMANTIC ANALYSIS OF WEB INTERFACE ACCESSIBILITY Web accessibility remains a critical aspect of ensuring equal opportunities for internet resource usage, especially for people with disabilities. The Web Content Accessibility Guidelines 2.5.3 criterion “Label in Name” requires that the accessible name of an interface component include text that is visually presented. Existing automated verification methods for this criterion are predominantly based on primitive string comparison, which does not account for semantic context. Objective: investigate the possibilities of using small language models with up to 1 billion parameters for automated semantic analysis of compliance with Web Content Accessibility Guidelines 2.5.3, as an alternative to resource-intensive large language models and limited algorithmic methods. Methodol- ogy: the research involved creating synthetic datasets (7,200 English-language and 5,615 Ukrainian-language samples) and using real-world datasets (Top500 – 380 samples, UaUniv – 319). Sentence Bidirectional Encoder Representations from Transformers models were tested for computing semantic similarity, and fine-tuning of the google/electra-base-discriminator model was performed for 3-class classification of semantic relationships (“similar”, “unrelated”, “opposite”). Results: the trained model of 437 MB demonstrated high accuracy on syn- thetic data (0.96) and sufficient accuracy on real datasets (Top500: 0.77, UaUniv: 0.73). The model effectively identifies all three classes of semantic relationships with an accuracy of 95.1 % for “opposite”, 92.7 % for “un- related”, and 97.4 % for “similar” texts in the validation sample. Conclusions: the research confirmed the feasi- bility of using small language models for automated verification of semantic compliance according to Web Content Accessibility Guidelines 2.5.3. The proposed approach provides acceptable classification accuracy with significantly lower computational costs compared to large language models, allowing for the integration of se- mantic analysis into standard development and testing processes. Despite certain limitations, the developed so- lution can significantly improve web accessibility testing. Keywords: Small Language Models, Semantic Analysis, Text Classification, Web Accessibility, Web Content Accessibility Guidelines, Model Fine-Tuning, Natural Language Processing Б. О. Кузіков, О. А. Шовкопляс, П. О. Титов, С. Р. Шовкопляс ЗАСТОСУВАННЯ МАЛИХ МОВНИХ МОДЕЛЕЙ ДЛЯ СЕМАНТИЧНОГО АНАЛІЗУ ДОСТУПНОСТІ ВЕБІНТЕРФЕЙСІВ Вебдоступність залишається важливим аспектом для забезпечення рівних можливостей користування ін- тернет-ресурсами, особливо для людей з інвалідністю. Критерій Настанов з доступності вебвмісту 2.5.3 «Мітка в імені» вимагає, щоб доступне ім’я компонента інтерфейсу включало текст, представлений візу- ально. Існуючі методи автоматизованої перевірки цього критерію базуються переважно на примітивному порівнянні рядків, не враховуючи семантичний контекст. Мета. Дослідити можливості застосування ма- лих мовних моделей із кількістю параметрів до 1 мільярда для автоматизованого семантичного аналізу відповідності критерію Настанов із доступністю вебвмісту 2.5.3 як альтернативи ресурсомістким вели- ким мовним моделям і обмеженим алгоритмічним методам. Методологія. Дослідження передбачало ство- рення синтетичних (англомовних – 7200, україномовних – 5615 прикладів) та використання реальних наборів даних (380 прикладів із 500 найвідвідуваніших вебсайтів, 319 прикладів із вебсайтів українських університетів). Здійснено тестування моделей двоспрямованих кодувальних представлень речень із тра- нсформерів для обчислення семантичної схожості та використано тонке налаштування базової дискримі- наторної моделі ELECTRA від Google: для 3-класової класифікації семантичних відношень («схожі», «не- пов’язані», «протилежні»). Результати. Навчена модель розміром 437 МБ продемонструвала високу точ- ність на синтетичних даних (0,96) та достатню точність на реальних наборах даних (0,77 для найвідвіду- ваніших вебсайтів та 0,73 для вебсайтів українських університетів). Модель здатна ефективно ідентифі- кувати всі три класи семантичних відношень із точністю 95,1 % для «протилежних», 92,7% для «не- пов’язаних» та 97,4% для «схожих» текстів на валідаційній вибірці. Висновки. Дослідження підтвердило доцільність застосування малих мовних моделей для автоматизованої перевірки семантичної відповідно- сті згідно з критерієм Настанов із доступністю вебвмісту 2.5.3. Запропонований підхід забезпечує при- Семантик Веб та лінгвістичні системи 78 йнятну точність класифікації при значно менших обчислювальних витратах порівняно з великими мов- ними моделями, що дозволяє інтегрувати семантичний аналіз у стандартні процеси розроблення та тес- тування. Незважаючи на певні обмеження, розроблене рішення може істотно покращити процес тесту- вання вебдоступності. Ключові слова: малі мовні моделі, семантичний аналіз, класифікація тексту, вебдоступність, Настанови з доступності вебвмісту, тонке налаштування моделей, обробка природної мови Introduction Motivation. Web accessibility is criti- cally important for ensuring equal access to in- formation and services for all users. The Web Content Accessibility Guidelines (WCAG) de- fine relevant standards. One important crite- rion is WCAG 2.5.3 “Label in Name”, which specifies that the accessible name of an inter- face component must include text that is visu- ally presented. This allows users with disabili- ties to rely on visible labels as a means of in- teraction: individuals using voice control can activate elements by speaking their visible names, and users of text-to-speech technolo- gies gain a better experience due to consistency between seen and heard text [1]. Developers have ethical and legal obligations to comply with accessibility standards. Conducting ac- cessibility testing is fundamental to improving application usability for people with disabili- ties and generally enhances usability for all us- ers [2]. Automated accessibility testing is rec- ognized as an effective tool for quickly identi- fying a significant portion of problems. It al- lows for systematic evaluation of the user in- terface and code for compliance with numer- ous rules and recommendations [2]. However, despite its advantages, automated testing is not an exhaustive solution and has significant lim- itations [3]. In particular, automated tools can- not fully evaluate the context of use, complex interactions, and subjective aspects of user ex- perience, which are critically important for us- ers with different needs [4, 5]. Passing auto- mated tests does not guarantee full application accessibility, and results may contain false pos- itives that require human verification. Thus, a comprehensive approach to ensuring accessi- bility requires a combination of automated testing with manual testing and involvement of users with disabilities. The task of verifying WCAG 2.5.3 cri- terion essentially comes down to fuzzy string comparison, with or without consideration of semantics and structure, depending on the quality of the tool. Existing automated tools may take into account Best Practice recom- mendations – “The label should begin with vis- ible text”. Classical, deterministic string com- parison algorithms can be divided into several categories: fuzzy text comparison (Le- venshtein distance, longest common subse- quence), phonetic algorithms (Soundex, Meta- phone, New York State Identification Intelli- gence System), and token-based methods (Jac- card coefficient, cosine similarity, BM25). It is important to note that from those listed, only the last category can account for semantic proximity of texts, but its capabilities are lim- ited without using. Recently, there has been significant in- terest in artificial intelligence capabilities. De- spite impressive results, tools such as ChatGPT have limitations in specific tasks, particularly in accessibility testing, as they use training data that does not always cover the specifics and requirements of modern accessibility standards [6]. In the context of artificial intel- ligence development, particularly in the field of natural language processing, language mod- els are often classified by their size and com- putational requirements. Large language mod- els (LLMs), such as GPT-4 or GLaM [7], con- tain hundreds of billions or even trillions of pa- rameters and require significant computational resources for training and execution. Due to their ability to consider semantic relationships and context, LLMs can understand and evalu- ate headings and labels with a high degree of accuracy. However, deploying LLMs for real- time accessibility testing faces significant chal- lenges, including high computational require- ments, high latency, and potential instability of provider APIs. This creates a need for efficient, reliable, and context-oriented solutions that can operate with limited computational re- sources. Семантик Веб та лінгвістичні системи 79 Small language models (SLMs) are sig- nificantly more compact – they typically con- tain from a few hundred million to a few billion parameters, ensuring their availability and ef- ficient deployment on standard equipment and facilitating integration into everyday tools and development processes [8]. Among SLMs, mi- cro- and nano-language models stand out, with parameter counts typically not exceeding one billion. Despite their smaller size, these models can achieve performance comparable to larger models in specific tasks after appropriate fine- tuning [8, 9]. For example, the Phi-3-mini model with only 3.8 billion parameters demon- strated performance corresponding to models twice its size in various natural language un- derstanding tasks [10]. The aim of the research is to analyze the capabilities of SLMs (up to 1 billion pa- rameters) for automated verification of compli- ance with WCAG 2.5.3 criterion. We investi- gate the use of Sentence Transformers (SBERT) models, ready-made classification tasks from the Hugging Face platform, and fine-tuning of a selected SLM. The hypothesis is that properly configured SLMs can provide accurate semantic analysis of relationships be- tween labels and names through 3-class classi- fication (“similar”, “unrelated”, “opposite”), while maintaining the performance character- istics necessary for integration into existing de- velopment processes. This could fill the gap between simple string matching and resource- intensive LLM-based solutions. The main research objectives: – Develop and adapt a methodology for applying SLMs for automated verification of compliance with WCAG 2.5.3 “Label in Name” criterion in web interfaces. – Evaluate the effectiveness of SBERT models for semantic analysis of relationships between visible text labels and accessible names in the context of WCAG 2.5.3 criterion. – Conduct fine-tuning of a selected SLM for 3-class classification of relationships between visible text labels and accessible names (“similar”, “unrelated”, “opposite”). – Validate the developed solution by evaluating its performance on synthetic da- tasets, as well as on real data from the Top 500 most visited websites (Top500) and Ukrainian university websites (UaUniv). Methodology Dataset Preparation. The research in- volved both synthetic and real datasets to eval- uate and train SLMs. Specifically, synthetic datasets (English – 7,200 samples, Ukrainian – 5,615 samples) were created using leading LLMs (Anthropic Claude, OpenAI ChatGPT, Google Gemini, Grok 3). A diverse range of models was used to increase input data variety and minimize potential biases. To ensure meaningful control over generated samples, a taxonomy of semantic changes was developed beforehand, describing typical text modifica- tions in web content with Accessible Rich Internet Applications (ARIA) attributes. Its ap- plication allowed for systematizing change types and ensuring the relevance of synthetic samples. The taxonomy classifies differences between visible text and its ARIA description, considering both the nature of changes and their potential impact on web resource accessi- bility and security. Categories include context expansion (“Submit” → “Submit registration form”), action object changes (“Submit pay- ment” → “Submit order”), action type changes (“Save” → “Save and delete”), negation (“Submit” → “Do not submit”), technical modifications (“Submit form” → “SUBMIT FORM”), etc. In addition to synthetic datasets, the study also utilized real datasets that reflect practical samples of semantic discrepancies in web content: – Top500 – contains 380 samples of differences between visible text and its repre- sentation for assistive technologies. Data was collected from pages of the 500 most popular websites according to the Moz ranking [11], ensuring representation of contemporary pub- licly accessible internet content. – UaUniv – includes 319 samples col- lected from the main pages of official websites of Ukrainian higher education institutions. Data collection was conducted as part of an ac- cessibility study in January 2024 [12], allow- ing assessment of text information presentation specifics in the educational segment of the Ukrainian internet space. For training classification models based on SLMs, input data was pre-annotated Семантик Веб та лінгвістичні системи 80 using the LLM-as-Judge approach [13]. In this approach, large language models serve as ex- pert evaluators, enabling the scaling of the an- notation process without requiring extensive human resources. Queries to LLMs consisted of instructions (system prompt) and user data – pairs of visible text and ARIA labels. The model evaluated semantic similarity between these elements on a scale from –1.0 to 1.0, where –1.0 indicates complete opposition or content contradiction, 0 means no connection, and 1.0 represents complete semantic corre- spondence. Intermediate values reflected par- tial correspondence, including cases of context expansion, changes in object or action type. To ensure annotation reliability, consistency checks were performed on ratings generated by different LLMs. The numerical ratings ob- tained from LLMs in the range [–1.0, 1.0] were quantized into three semantic correspondence categories: 1. “Similar” (same/similar) – texts have identical or very close content (LLM ratings within [0.65, 1.0]). 2. “Not related” (not related) – texts have no semantic connection (ratings near zero). 3. “Opposite” (opposite) – texts have opposite or contradictory meanings (ratings within [–1.0, –0.15]). Results obtained in previous research stages allowed us to create a high-quality an- notated dataset that can be used for model training and evaluation. This is an important resource, especially considering the task com- plexity and the need for high-quality annota- tions for effective machine learning in the field of semantic text analysis. As part of the research, more economi- cal models were tested for their suitability for semantic text comparison tasks. The following models were tested: mistral/ministral-8b, qwen/qwen2.5-coder-7b-instruct, meta- llama/llama-3.1-8b-instruct, amazon/nova-mi- cro-v1, liquid/lfm-3b, openai/gpt-4.1-nano, google/gemini-2.5-pro-exp-03-25, liquid/lfm- 7b. However, none of these models demon- strated sufficient effectiveness for accurate analysis of semantic relationships between vis- ible text labels and accessible names. This in- dicates the need to use more powerful models or additional fine-tuning to achieve acceptable results in this domain. Using SBERT for Semantic Similar- ity. One promising approach to solving the se- mantic text comparison problem is using SBERT [14] – a specialized Python module for working with modern vector representation models and rerankers. This framework pro- vides access to creating, using, and training state-of-the-art models for computing text em- beddings. Sentence Transformers can be used both for computing vector representations of texts using Sentence Transformer models and for calculating text similarity metrics using Cross-Encoder models. This opens a wide range of applications, including semantic search, determining semantic textual similar- ity, and paraphrase detection. Over 10,000 pre-trained Sentence Transformer models are available on the Hug- ging Face platform for immediate application, including many modern models from the Massive Text Embeddings Benchmark (MTEB) [15] ranking. Additionally, the framework makes it easy to train or fine-tune custom embedding models or rerankers, ena- bling the creation of specialized models for specific use cases. Particularly promising is the application of this toolkit for Semantic Textual Similarity (STS) tasks. In this ap- proach, vector representations are created for all analyzed texts, after which similarity met- rics between them are calculated. Text pairs with the highest similarity score are consid- ered semantically closest. Testing on a synthetic dataset showed that baseline SBERT models rank text similar- ity well. However, standard distance metrics (Euclidean, cosine similarity) do not effec- tively distinguish texts with opposite content, often classifying them in the same category as unrelated texts. This is a significant limitation for our task. Figure 1 shows classification ac- curacy by category for baseline SBERT mod- els, illustrating this problem. Семантик Веб та лінгвістичні системи 81 Fig. 1. Per-Category Classification Accuracy for basic SBERT Hugging Face Tasks and Model Se- lection for Fine-Tuning. To find a compro- mise between computational efficiency and re- sult quality, we turned to the Hugging Face ecosystem, which provides ready-made pipe- lines for natural language processing tasks. The key advantage of this approach is that most pre-trained models are relatively small and can be run locally, significantly reducing their usage cost compared to APIs of large models. In our research, we considered several types of tasks available in Hugging Face: – zero-shot-classification – a method that allows classifying texts by categories without seeing examples of these categories during training; – text-classification – the traditional approach to assigning text to predefined cate- gories; – fill-mask – a task where the model fills in missing words in text, which can be used to evaluate semantic proximity; – question-answering – the model an- swers questions based on context, which can potentially be adapted for text comparison; – text-generation – creating new text that can be used for paraphrasing and subse- quent comparison. For these tasks, we tested a wide range of models of various architectures and sizes: “google/flan-t5-large”, “google/electra-large- discriminator”, “facebook/bart-large-mnli”, “roberta-large-mnli”, “l-yohai/bigbird-roberta- base-mnli”, “cross-encoder/nli-distilroberta- base”, “distilbert-base-uncased-finetuned-sst- 2-english”, “bert-base-uncased”, “albert-base- v2”, “roberta-base”, “distilbert-base-cased- distilled-squad”, “deepset/roberta-base- squad2”, “google/electra-large-discriminator”, “EleutherAI/gpt-neo-125M”, “gpt2”, “dis- tilgpt2” and “facebook/opt-350m”. Testing was conducted taking into account the archi- tectural features of each model and the specif- ics of the corresponding pipelines. Figure 2 presents a comparison of different models by size (in megabytes) and achieved accuracy on the synthetic dataset. Семантик Веб та лінгвістичні системи 82 Fig. 2. Model Performance (Accuracy vs. Size) Analysis showed that none of the tested combinations of ready-made models and pipe- lines fully met the task requirements: models either demonstrated insufficient accuracy or were too large for efficient local use. There- fore, a decision was made to fine-tune a model. For this purpose, google/electra-base-discrim- inator [16] was chosen in combination with the text-classification pipeline as the most promis- ing in terms of balance between size, potential accuracy, and computational requirements. Fine-tuning of the google/electra-base- discriminator model was conducted on a mixed dataset that included English and Ukrainian synthetic samples. The dataset was augmented by swapping texts in pairs, considering the symmetry of the similarity function. Thus, the training set contained 17,941 samples, and the validation set – 7,689 samples. Results As a result of fine-tuning the google/electra-base-discriminator model, we obtained a model with a size of 437 MB. On the validation set (from synthetic data), the model achieved the following metrics: F1- score = 0.96, Recall = 0.96. The results of model fine-tuning are shown in Figure 3. Семантик Веб та лінгвістичні системи 83 a b c Fig. 3. Loss (a), F1-Score (b), and Confusion Matrix (c) The Loss graph demonstrates a stable decrease in both training and validation losses to a level of ≈0.2–0.3 after 2000–2500 steps, indicating effective optimization. The Valida- tion F1-Score increased to 0.96, reflecting high classification accuracy. The confusion matrix confirms accuracy for the classes “opposite” (95.1 %), “not related” (92.7 %), and “same/similar” (97.4 %). Testing of the fine-tuned model on real data from the Top500 and UaUniv datasets showed acceptable accuracy. The results are presented in Table 1. Table 1. Results of testing the fine-tuned model on real data Top500 UaUniv Accuracy 0.76 0.72 Precision 0.79 0.74 Recall 0.76 0.72 F1-score 0.77 0.73 Discussion The article addresses a fundamental contradiction in the comparison of visible text and text for assistive technologies, where at- tention is often focused solely on formal com- pliance, such as “The aria-label text begins with the text of the visible label”. This compli- ance is easily verifiable algorithmically but fails to account for the semantic content of the texts. Basic methods, such as fuzzy string com- parison (e.g., Levenshtein distance), are inca- pable of considering the semantic content of texts. This makes them unsuitable for the task of evaluating semantic similarity, as they may ignore significant differences or incorrectly classify texts as similar when their content dif- fers. As evidence, in the Top500 dataset of 382 samples, 287 (75 %) were identified by LLM as semantically similar but formally violate the WCAG 2.5.3 criterion for algorithmic meth- ods. Similarly, in the UaUniv dataset, 149 out of 320 samples (46 %) were classified by LLM as similar but did not meet the criterion for basic methods. These results demonstrate that basic methods cannot provide adequate seman- tic analysis, therefore comparison with them was not conducted in this study. The research results demonstrate that micro- and nano-language models, particularly Семантик Веб та лінгвістичні системи 84 the fine-tuned google/electra-base-discrimina- tor model, can be effectively used for auto- mated verification of compliance with the WCAG 2.5.3 criterion. The transition from a detailed continuous scale of semantic similar- ity assessment from –1.0 to 1.0, which was used for data annotation by large language models (where, recall, 1.0 meant complete se- mantic correspondence, and –1.0 meant oppo- sition), to a more generalized 3-class classifi- cation (“similar”, “unrelated”, “opposite”) proved to be a successful approach for training SML. This allowed for high accuracy on syn- thetic data (F1 = 0.96). The initial investigation of SBERT models confirmed their ability to rank similar- ity but also revealed limitations in clearly dis- tinguishing semantically opposite texts using standard distance metrics. This highlighted the need for more specialized approaches, such as fine-tuning for a specific classification task. The performance of the fine-tuned model on real datasets Top500 (F1 = 0.77) and UaUniv (F1 = 0.73) is somewhat lower than on synthetic data. This is expected, as real data of- ten contains greater diversity and complexity of samples than synthetic data. However, the achieved indicators are still sufficiently high for practical application, especially consider- ing the significantly lower computational re- sources required for SLM compared to LLM. Analysis of real websites showed that the vast majority of detected errors were related not so much to subtle semantic nuances between vis- ible text and accessible name, but to funda- mentally incorrect markup. This often made the use of assistive technologies extremely dif- ficult or even impossible, rather than merely causing confusion due to semantic discrepan- cies. Among unexpected patterns, it is worth highlighting the contextual sensitivity and cer- tain language independence of SLM, which are positive aspects. Research Limitations. The analysis of real data was limited to the Top500 and UaUniv datasets, which, although representa- tive, do not cover the entire spectrum of web- sites. The effectiveness of SLM largely de- pends on the quality of data for fine-tuning and the fine-tuning process itself. Despite the achieved results, it is im- portant to remember that automated accessibil- ity testing, even using advanced models simi- lar to those proposed, is not a "silver bullet." It effectively identifies technical compliance with standards but cannot fully replace human verification for evaluating context, complex interactions, and overall user experience [3]. Thus, the presented research contributes to this important field by offering a solution that sim- plifies accessibility testing and reduces barri- ers to its implementation, but the best results are achieved when combining automated methods with manual expertise. Practical Significance. The obtained results confirm that SLMs can serve as a foun- dation for developing new, more accessible, and efficient tools for automated web accessi- bility verification. This allows for the integra- tion of semantic analysis into development processes without excessive resource expendi- ture. The application of specialized AI tools developed by accessibility experts can signifi- cantly improve the testing process and expand its coverage [6]. Thus, the presented study of micro- and nano-language models contributes to this important field by offering a solution that simplifies accessibility testing and reduces barriers to its implementation. Conclusion The research demonstrated the effec- tiveness of using micro- and nano-language models for automating the verification of se- mantic compliance according to the WCAG 2.5.3 criterion “Label in Name”. Key findings: 1. SBERT models are useful for obtain- ing vector representations of texts and initial similarity ranking; however, standard metrics do not reliably distinguish semantically oppo- site texts. 2. Fine-tuning a relatively small model (google/electra-base-discriminator, 110M pa- rameters) for the task of 3-class classification (“similar”, “unrelated”, “opposite”) allowed for high accuracy (F1 = 0.96) on synthetic data and sufficient accuracy (F1 up to 0.77) on real data (Top500, UaUniv). 3. SLMs are significantly more com- pact and less resource-intensive compared to LLMs, making them suitable for local deploy- Семантик Веб та лінгвістичні системи 85 ment and integration into various development tools. 4. The developed approach offers a practical solution for improving automated ac- cessibility testing, complementing existing tools with semantic analysis capabilities. Despite the achieved results, it is im- portant to remember the limitations of SLMs and the necessity of human verification in complex cases. Further research may be di- rected toward expanding training datasets, in- vestigating other SLM architectures and fine- tuning methods, as well as integrating the de- veloped models into comprehensive accessi- bility testing systems. This will contribute to creating a more accessible web environment for all users. To ensure the reproducibility of our re- search, we publish our artifacts on Kaggle [17]. References 1. Web Content Accessibility Guidelines (WCAG) 2.1 [Electronic resource]. 2025. URL: https://www.w3.org/TR/WCAG21/ (accessed: 01.06.2025). 2. Suarez C. Comprehensive Guide to Automated Accessibility Testing [Electronic resource]. 2024. URL: https://kobiton.com/blog/comprehensive- guide-to-automated-accessibility-testing/ (accessed: 01.06.2025). 3. Prasad M. DigitalA11Y. Automated Accessibility Testing Is Not a Sil-ver Bullet [Electronic resource]. 2025. URL: https://www.digitala11y.com/automated- accessibility-testing-is-not-a-silver-bullet/ (accessed: 01.06.2025). 4. Wieland R. Limitations of an Automated-Only Web Accessibility Plan [Electronic resource]. 2024. URL: https://allyant.com/blog/limitations-of-an- automated-only-web-accessibility-plan/ (accessed: 01.06.2025). 5. Intelligence Community Design System. Limitations of automated testing [Electronic resource]. URL: https://design.sis.gov.uk/accessibility/testing/a utomated-testing-limitation (accessed: 01.06.2025). 6. Barrell N. Enhancing Accessibility with AI and ML [Electronic resource]. 2023. URL: https://www.deque.com/blog/enhancing- accessibility-with-ai-and-ml/ (accessed: 01.06.2025). 7. Du N. et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts // Proc Mach Learn Res. ML Research Press, 2021. Vol. 162. P. 5547–5569. 8. Achary S. The Rise of Small Language Models: A New Era of AI Accessibility and Efficiency [Electronic resource]. 2024. URL: https://medium.com/small-language- models/the-rise-of-small-language-models-a- new-era-of-ai-accessibility-and-efficiency- 752322d82656 (accessed: 01.06.2025). 9. Aralimatti R. et al. Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective. Preprints, 2025. 10. Abdin M. et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. 2024. 11. Moz. Top 500 Most Popular Websites [Electronic resource]. URL: https://moz.com/top500 (accessed: 01.06.2025). 12. Титов П.О., Шовкопляс О.А., Кузіков Б.О. Аналіз вебдоступності сайтів українських закладів вищої освіти // Системні дослі- дження та інформаційні технології. 2025. № 2. 13. Gu J. et al. A Survey on LLM-as-a-Judge. 2024. Vol. 1. 14. Reimers N., Gurevych I. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation // EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. P. 4512–4525. 15. Muennighoff N. et al. MTEB: Massive Text Embedding Benchmark // EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2022. P. 2006–2029. 16. Clark K. et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators // 8th International Conference on Learning Representations, ICLR 2020. International Conference on Learning Representations, ICLR, 2020. 17. Tytov P. Semantic Language Models for WCAG [Electronic resource]. URL: https://www.kaggle.com/datasets/tytovpavel/s emantic-language-models-for-wcag (accessed: 01.06.2025). Семантик Веб та лінгвістичні системи 86 Одержано: 06.06.2025 Внутрішня рецензія отримана: 13.06.2025 Зовнішня рецензія отримана: 16.06.2025 Про авторів: Кузіков Борис Олегович, к.т.н., доцент https://orcid.org/0000-0002-9511-5665 b.kuzikov@cs.sumdu.edu.ua Шовкопляс Оксана Анатоліївна к.ф.-м.н., доцент https://orcid 0000-0002-4596-2524 o.shovkoplyas@mss.sumdu.edu.ua Титов Павло Олегович здобувач ступеня доктора філософії https://orcid.org/0009-0003-6911-5463 stegaspasha@gmail.com Шовкопляс Сергій Ростиславович здобувач ступеня доктора філософії https://orcid.org/0000-0003-1837-0213 s.shovkoplyas@student.sumdu.edu.ua Місце роботи та навчання авторів: Сумський державний університет, кафедра комп’ютерних наук, 40007, м. Суми, вул. Харківська, 116, тел. +38(0542)687776
id	pp_isofts_kiev_ua-article-839
institution	Problems in programming
keywords_txt_mv	keywords
language	English
last_indexed	2025-11-04T02:10:25Z
publishDate	2025
publisher	PROBLEMS IN PROGRAMMING
record_format	ojs
resource_txt_mv	ppisoftskievua/ea/d7b90a22515e619a336265aacea861ea.pdf
spelling	pp_isofts_kiev_ua-article-8392025-11-03T11:33:19Z Application of small language models for semantic analysis of Web interface accessibility Застосування малих мовних моделей для семантичного аналізу доступності веб-інтерфейсів Kuzikov, B.O. Shovkoplias, O.A. Tytov, P.O. Shovkoplias, S.R. Small Language Models; Semantic Analysis; Text Classification; Web Accessibility; Web Content Accessibility Guidelines; Model Fine-Tuning; Natural Language Processing UDC 004.89:004.51 малі мовні моделі; семантичний аналіз; класифікація тексту; веб-доступність; настанови з доступності вебвмісту; тонке налаштування моделей; обробка природної мови УДК 004.89:004.51 Web accessibility remains a critical aspect of ensuring equal opportunities for internet resource usage, especially for people with disabilities. The Web Content Accessibility Guidelines 2.5.3 criterion “Label in Name” requires that the accessible name of an interface component include text that is visually presented. Existing automated verification methods for this criterion are predominantly based on primitive string comparison, which does not account for semantic context. Objective: investigate the possibilities of using small language models with up to 1 billion parameters for automated semantic analysis of compliance with Web Content Accessibility Guidelines 2.5.3, as an alternative to resource-intensive large language models and limited algorithmic methods. Methodol ogy: the research involved creating synthetic datasets (7,200 English-language and 5,615 Ukrainian-language samples) and using real-world datasets (Top500 – 380 samples, UaUniv – 319). Sentence Bidirectional Encoder Representations from Transformers models were tested for computing semantic similarity, and fine-tuning of the google/electra-base-discriminator model was performed for 3-class classification of semantic relationships (“similar”, “unrelated”, “opposite”). Results: the trained model of 437 MB demonstrated high accuracy on syn thetic data (0.96) and sufficient accuracy on real datasets (Top500: 0.77, UaUniv: 0.73). The model effectively identifies all three classes of semantic relationships with an accuracy of 95.1 % for “opposite”, 92.7 % for “un related”, and 97.4 % for “similar” texts in the validation sample. Conclusions: the research confirmed the feasi bility of using small language models for automated verification of semantic compliance according to Web Content Accessibility Guidelines 2.5.3. The proposed approach provides acceptable classification accuracy with significantly lower computational costs compared to large language models, allowing for the integration of se mantic analysis into standard development and testing processes. Despite certain limitations, the developed so lution can significantly improve web accessibility testing.Prombles in programming 2025; 2: 77-86 Вебдоступність залишається важливим аспектом для забезпечення рівних можливостей користування ін тернет-ресурсами, особливо для людей з інвалідністю. Критерій Настанов з доступності вебвмісту 2.5.3 «Мітка в імені» вимагає, щоб доступне ім’я компонента інтерфейсу включало текст, представлений візу ально. Існуючі методи автоматизованої перевірки цього критерію базуються переважно на примітивному порівнянні рядків, не враховуючи семантичний контекст. Мета. Дослідити можливості застосування ма лих мовних моделей із кількістю параметрів до 1 мільярда для автоматизованого семантичного аналізу відповідності критерію Настанов із доступністю вебвмісту 2.5.3 як альтернативи ресурсомістким вели ким мовним моделям і обмеженим алгоритмічним методам. Методологія. Дослідження передбачало ство рення синтетичних (англомовних – 7200, україномовних – 5615 прикладів) та використання реальних наборів даних (380 прикладів із 500 найвідвідуваніших вебсайтів, 319 прикладів із вебсайтів українських університетів). Здійснено тестування моделей двоспрямованих кодувальних представлень речень із тра нсформерів для обчислення семантичної схожості та використано тонке налаштування базової дискримі наторної моделі ELECTRA від Google: для 3-класової класифікації семантичних відношень («схожі», «не пов’язані», «протилежні»). Результати. Навчена модель розміром 437 МБ продемонструвала високу точ ність на синтетичних даних (0,96) та достатню точність на реальних наборах даних (0,77 для найвідвіду ваніших вебсайтів та 0,73 для вебсайтів українських університетів). Модель здатна ефективно ідентифі кувати всі три класи семантичних відношень із точністю 95,1 % для «протилежних», 92,7% для «не пов’язаних» та 97,4% для «схожих» текстів на валідаційній вибірці. Висновки. Дослідження підтвердило доцільність застосування малих мовних моделей для автоматизованої перевірки семантичної відповідно сті згідно з критерієм Настанов із доступністю вебвмісту 2.5.3. Запропонований підхід забезпечує прийнятну точність класифікації при значно менших обчислювальних витратах порівняно з великими мов ними моделями, що дозволяє інтегрувати семантичний аналіз у стандартні процеси розроблення та тес тування. Незважаючи на певні обмеження, розроблене рішення може істотно покращити процес тесту вання вебдоступності.Prombles in programming 2025; 2: 77-86 PROBLEMS IN PROGRAMMING ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ ПРОБЛЕМИ ПРОГРАМУВАННЯ 2025-09-07 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/839 10.15407/pp2025.02.077 PROBLEMS IN PROGRAMMING; No 2 (2025); 77-86 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 2 (2025); 77-86 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 2 (2025); 77-86 1727-4907 10.15407/pp2025.02 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/839/890 Copyright (c) 2025 PROBLEMS IN PROGRAMMING
spellingShingle	Small Language Models Semantic Analysis Text Classification Web Accessibility Web Content Accessibility Guidelines Model Fine-Tuning Natural Language Processing UDC 004.89:004.51 Kuzikov, B.O. Shovkoplias, O.A. Tytov, P.O. Shovkoplias, S.R. Application of small language models for semantic analysis of Web interface accessibility
title	Application of small language models for semantic analysis of Web interface accessibility
title_alt	Застосування малих мовних моделей для семантичного аналізу доступності веб-інтерфейсів
title_full	Application of small language models for semantic analysis of Web interface accessibility
title_fullStr	Application of small language models for semantic analysis of Web interface accessibility
title_full_unstemmed	Application of small language models for semantic analysis of Web interface accessibility
title_short	Application of small language models for semantic analysis of Web interface accessibility
title_sort	application of small language models for semantic analysis of web interface accessibility
topic	Small Language Models Semantic Analysis Text Classification Web Accessibility Web Content Accessibility Guidelines Model Fine-Tuning Natural Language Processing UDC 004.89:004.51
topic_facet	Small Language Models Semantic Analysis Text Classification Web Accessibility Web Content Accessibility Guidelines Model Fine-Tuning Natural Language Processing UDC 004.89:004.51 малі мовні моделі семантичний аналіз класифікація тексту веб-доступність настанови з доступності вебвмісту тонке налаштування моделей обробка природної мови УДК 004.89:004.51
url	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/839
work_keys_str_mv	AT kuzikovbo applicationofsmalllanguagemodelsforsemanticanalysisofwebinterfaceaccessibility AT shovkopliasoa applicationofsmalllanguagemodelsforsemanticanalysisofwebinterfaceaccessibility AT tytovpo applicationofsmalllanguagemodelsforsemanticanalysisofwebinterfaceaccessibility AT shovkopliassr applicationofsmalllanguagemodelsforsemanticanalysisofwebinterfaceaccessibility AT kuzikovbo zastosuvannâmalihmovnihmodelejdlâsemantičnogoanalízudostupnostívebínterfejsív AT shovkopliasoa zastosuvannâmalihmovnihmodelejdlâsemantičnogoanalízudostupnostívebínterfejsív AT tytovpo zastosuvannâmalihmovnihmodelejdlâsemantičnogoanalízudostupnostívebínterfejsív AT shovkopliassr zastosuvannâmalihmovnihmodelejdlâsemantičnogoanalízudostupnostívebínterfejsív

Application of small language models for semantic analysis of Web interface accessibility

Репозитарії

Схожі ресурси