Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe

This paper substantiates the relevance of addressing the problem of ensuring the quality of change descriptions in source code files within version control systems. To filter commit messages, machine learning methods are employed, including neural networks of various architectures. The use of neural...

Full description

Saved in:
Bibliographic Details
Date:2026
Main Authors: Semonov, Bohdan, Pogorilyy, Sergiy
Format: Article
Language:English
Published: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026
Subjects:
Online Access:https://journal.iasa.kpi.ua/article/view/365243
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:System research and information technologies
Download file: Pdf

Institution

System research and information technologies
_version_ 1869472195860758528
author Semonov, Bohdan
Pogorilyy, Sergiy
author_facet Semonov, Bohdan
Pogorilyy, Sergiy
author_institution_txt_mv [ { "author": "Bohdan Semonov", "institution": "Taras Shevchenko National University of Kyiv, Kyiv" }, { "author": "Sergiy Pogorilyy", "institution": "Taras Shevchenko National University of Kyiv, Kyiv" } ]
author_sort Semonov, Bohdan
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2026-06-30T06:14:59Z
description This paper substantiates the relevance of addressing the problem of ensuring the quality of change descriptions in source code files within version control systems. To filter commit messages, machine learning methods are employed, including neural networks of various architectures. The use of neural networks is justified by the need to identify descriptions that accurately reflect the intent of the changes. A comparative analysis of word embedding methods (Word2Vec, FastText, and GloVe) was conducted, along with their application in binary classifiers such as MLP and RNN for filtering code changes. The models were trained on a dataset of change descriptions collected via the GitHub REST API. Model performance was evaluated using Accuracy and F1-score metrics. The effectiveness of the Google Colab environment for prototyping machine learning models was also confirmed.
doi_str_mv 10.20535/SRIT.2308-8893.2026.2.01
first_indexed 2026-07-01T01:00:18Z
format Article
fulltext  B. O. Semonov, S. D. Pogorilyy, 2026 Системні дослідження та інформаційні технології, 2026, № 2 7 TIÄC ТЕОРЕТИЧНІ ТА ПРИКЛАДНІ ПРОБЛЕМИ ІНФОРМАТИКИ UDC 004.05:004.85 DOI: 10.20535/SRIT.2308-8893.2026.2.01 DESIGN AND EVALUATION OF A QUALITY ASSURANCE METHOD FOR COMMIT MESSAGES IN VERSION CONTROL SYSTEMS USING WORD2VEC, FASTTEXT, AND GLOVE EMBEDDINGS B.O. SEMONOV, S.D. POGORILYY Abstract. This paper substantiates the relevance of addressing the problem of ensuring the quality of change descriptions in source code files within version control systems. To filter commit messages, machine learning methods are employed, including neural networks of various architectures. The use of neural networks is justified by the need to identify descriptions that accurately reflect the intent of the changes. A comparative analysis of word embedding methods (Word2Vec, FastText, and GloVe) was conducted, along with their application in binary classifiers such as MLP and RNN for filtering code changes. The models were trained on a dataset of change descriptions collected via the GitHub REST API. Model performance was evaluated using Accuracy and F1-score metrics. The effectiveness of the Google Colab environment for prototyping machine learning models was also confirmed. Keywords: AdamW algorithm, commit message, GitHub REST API, GRU layer, MLP (Multilayer Perceptron), RNN (Recurrent Neural Network), source code, software, change description, repository, harmonic mean, version control system. INTRODUCTION In the digital era, where software development plays a crucial role across various industries, version control systems have become an essential tool for managing source code. The dynamic nature of the market demands not only speed and quality from developers but also a structured and reliable approach to version management [1]. Version control systems enable developers to work on projects with flexibility, implement changes, and test new features while retaining the ability to revert to previous versions when necessary. This helps prevent data loss, maintain system stability, and enhance team collaboration. The preservation of changes in a version control system plays a key role in understanding, tracking, and maintaining the history of modifications within the codebase. Each commit message reflects a specific edit made by a developer. A detailed description of changes ensures that both current and future team members can understand what adjustments were made and for what purpose [2]. In summary, detailed change descriptions improve project documentation, simplify technical support, facilitate the tracking of development stages and decision-making logic, and contribute to greater project transparency. B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 8 The development of a method for ensuring the quality of comments on source code changes in version control systems is becoming increasingly important due to the growing scale of projects and the increasing complexity of codebase structures. Since modern projects may involve developers with varying levels of experience and different approaches to programming, a change description filter that evaluates the content and context of comments will facilitate better understanding of changes by other team members and simplify technical support [3]. Thus, a change description is a message whose content adheres to the general rules for writing commit messages, is concise, and describes the nature of the changes, their effect, and the reason for the modifications. 1. Rules for formatting commit messages [4]: a) the description should include a header and may also contain a body. These sections must be separated by a blank line; b) the header should be concise, typically limited to 50–70 characters in length; c) verbs in the header should be used in the imperative form. 2. Rules for presenting information in a commit message: a) the message should explain the reason why the changes were made; b) the message should indicate the effect of the changes; c) if the changes address a specific issue, it is necessary to mention it; d) the information should be presented in a way that allows a specialist, who may not be familiar with the issue or the code structure, to understand what has been done and what impact it has on the project (software). Examples of qualitative descriptions are provided in Table 1. T a b l e 1 . Commit messages that comply with rule 2.d No. Example of description 1 Add array identifiers for generating routes. This allows for resources with multi-column keys to generate the related routes by only passing an instance of the object in question 2 Increases the select timeout in start reactor() endless loop. This small patch greatly reduces CPU time (for instance for backgroundrb) 3 Add a “variance” method and a “sum” method to the Array class to stay DRY 4 Fixed autocomplete with scrollbar. IE has problems when the ul has a scrollbar and the user clicks there. The activate event is thrown and the list disappears PROBLEM STATEMENT All the rules for writing commit messages presented in the previous section can be processed using any programming language. However, for rule 2.d, machine learning must be used, as there are no formal rules for determining what has been done and why. This requires “human” understanding of the context and content of the commit message. Hence, the problem statement arises: to develop a method for ensuring the quality of commit messages in version control systems, which will analyze the description of changes made by the developer to the source code and return a label indicating whether the comment answers the question “what was done and why?” – i.e., “yes” or “no”. Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 9 Formalization of the filter problem. Let X be the set of commit descriptions (the universal set), where each of these commit messages is represented by an m-dimensional word vector: 𝑥⃗ = 𝑥 , 𝑥 , … , 𝑥 , 𝑥⃗ ∈ 𝑿, 1 where m is the dimensionality of the word vector. Y is the set of possible responses (labels) in the form of an M-dimensional vector: 𝑦⃗ = 𝑦 , … , 𝑦 , 𝑦⃗ ∈ 𝒀, 2 where M = 2 is the number of possible responses (classes) to be obtained: the first response indicates whether the current commit message meets the requirements, and the second response indicates that it does not. Hence, the sought filter is a binary classification problem: Y = {0, 1} [5]. Thus, the commit message filtering model can be described as such a surjective, but not injective, mapping: 𝑓: 𝑿 → 𝒀 3 COLLECTION AND PREPARATION OF TRAINING CORPORA FOR COMMIT MESSAGES For data collection, one of the largest web services for hosting IT projects and collaborative software development, GitHub, was used. This platform is based on the well-known version control system Git. GitHub provides a special application programming interface (API) called the GitHub REST API [6]. Since the research is focused on verifying the content of commit messages related to changes (differences, abbreviated as “diffs”) for version control systems, the following interfaces were required: 1) obtaining a list of repositories (projects); 2) obtaining commit messages for each repository. The aforementioned services use the GET HTTP request type, which must generate and send the corresponding GET parameters and HTTP headers to the server. In turn, the response from the services is received in the text format of structured JSON data. The HTTP requests listed in Table 2 have identical sets of HTTP headers, namely: 1. 'Accept': 'application/vnd.github+json' – the response type from the GitHub REST API; 2. 'Authorization': 'Bearer [token]' – user authorization for the API, where token is a special set of characters that can be generated in the user's settings on the GitHub web service; 3. 'X-GitHub-Api-Version': '2022-11-28' – the version of the GitHub REST API. B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 10 T a b l e 2 . Features of the GitHub REST API services No. Service Name URL Link GET Parameters “Useful” Fields for Research 1 Obtaining a list of repositories https://api.github.com/ repositories since – a parameter whose value must be an integer and corresponds to the repository identifier. It is not mandatory. If this parameter is present, the response will return a list of repositories whose identifiers are greater than the one specified in the parameter id – the repository identifier; full_name – the repository name; description – a detailed description of the repository 2 Obtaining commit messages for each repository https://api.github.com/ repos/[full_name]/com mits?per_page=100, where full_name is the repository name from item No. 1 per_page – the number of commit messages returned per request. It is an optional parameter. By default, it is set to 30 messages. sha – a string sequence (hash) of the commit message from which the search should begin sha – a string sequence (hash) of the commit message; message – the commit message. It is located in the “commit” JSON object; date – the date and time when the changes were made to the repository. It is located in the “author” JSON object, which is, in turn, contained within the “commit” JSON object. sha from the “parents” JSON array – a string sequence (hash) of the commit message for previous changes After the data was uploaded, manual labeling was performed, meaning that each commit message was assigned to the corresponding class according to rule 2.d. Since this process is time-consuming, this study uses 8.000 messages out of the 1 million uploaded. After analyzing the labeled commit messages, it was concluded that the data is imbalanced. In the obtained sample, the descriptions that do not comply with rule 2.d outnumber those that meet the requirements. Fig. 1 shows the distribution of the training corpora of commit messages. Fig. 1. Distribution of commit messages Before feeding the commit messages into the neural networks under study, the following steps were performed: Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 11 1) tokenization of the messages in the training dataset [7]; 2) vectorization of the messages. During the creation of the message vectorization object, the length of the input token sequence was fixed, as the input tensor for the neural network must have a consistent shape. Additionally, the vocabulary size was limited, thereby discarding the least frequently used tokens to accelerate computation [8]. METHOD FOR ENSURING THE QUALITY OF COMMIT MESSAGES IN SOURCE CODE FOR VERSION CONTROL SYSTEMS The foundation of the first filter (Fig. 2) is based on a “classical” neural network, namely a multilayer perceptron (MLP). This model consists of: 1. two weighted layers: a) an input (Embedding) layer; b) a hidden fully connected (Dense) layer, which contains a total of 48 neurons; c) an output Dense layer [9] with a single neuron, which will return the desired feedback (M = 2); 2. an intermediate layer without weights (Global Average Pooling), which “condenses” the entire input sequence into a single fixed-length vector for further transmission to the Dense layer [9]. The AdamW optimization algorithm [10] was chosen, which is a variant of the Adam algorithm with a special weight decay regularization. The selected algorithm demonstrates faster training and better generalization, with the main steps being: 1. parameter initialization: a) setting the initial values for the model parameters 𝜃 (weight vector) randomly; b) initialization of the first moment (estimate of the mean gradient) 𝑚 = 0 and the second moment (estimate of the gradient variance) 𝑣 = 0; c) setting the hyperparameters: 𝜂 – learning rate; 𝛽 and 𝛽 – decay coefficients for the first and second moments; 𝜖 – a small value to prevent division by zero; 𝜆 – weight decay coefficient; 2. gradient computation: in each iteration 𝑡, the gradient of the loss function is calculated: 𝑔 = ∇ 𝑓 𝜃 , 4 where 𝑓 𝜃 is the loss function; 𝑔 – the gradient for the current parameters; 3. moment updates: a) update of the first moment (estimate of the mean gradient): 𝑚 = 𝛽 𝑚 + 1 − 𝛽 𝑔 5 b) update of the second moment (estimate of the gradient’s variance): 𝑣 = 𝛽 𝑣 + 1 − 𝛽 𝑔 6 B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 12 4. correction of moment shifts: in order to compensate for shifts at the initial stages (since 𝑚 = 0 and 𝑣 = 0), a correction is applied: 𝑚 = 𝑚1 − 𝛽 , 7 𝑣 = 𝑣1 − 𝛽 8 5. parameter update considering weight decay and adaptive learning rate: 𝜃 = 𝜃 − 𝜂 𝑚𝑣 + 𝜖 + 𝜆𝜃 , 9 where the weight decay term 𝜆𝜃 is applied separately from the main gradient update process, which distinguishes AdamW from the classical Adam optimizer; 6. steps 2–5 are repeated for each iteration 𝑡 until a stopping condition is met (depending on the number of epochs or predefined stopping criteria). To mitigate the effect of overfitting, regularization techniques were employed: 1. an additional Dropout layer with a rate of 0.3 was introduced – randomly deactivating neurons in the layer with a specified probability p during each training iteration – with the aim of preventing co-adaptation of neuron weights, which in turn leads to overfitting [11]; 2. the early stopping technique was additionally implemented. Fig. 2. Architecture of the “classical” MLP model with regularization techniques For the construction of the second filter model (Fig. 3), a single bidirectional GRU layer with 24 neurons was used (in total, as in the first filter model, 48 neurons). Fig. 3. Architecture of the model with a single recurrent BiGRU layer and regularization techniques Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 13 It should be noted that the use of the aforementioned models is based on the results obtained in the work “The Implementation of a Commit Messages Filter for Software Version Control Systems” [5]. The Embedding input layer, present in both models, serves as a dictionary mapping, specifically: it transforms discrete word indices in the text into dense vector representations. In fact, it is a weight matrix that is learned during the model training process or uses pre-trained word vectors, such as Word2Vec [12], FastText [13], or GloVe [14], which are applied in this work. Their use is justified by the fact that high-quality word vector representations, obtained from a large corpus, are required for the message filter regarding changes, improving the generalization of the neural network. Table 3 presents a comparative analysis of methods for creating word embeddings. T a b l e 3 . Comparative analysis of word embedding methods Characteristic Word2Vec FastText GloVe Model type Neural network (CBOW, Skip-gram) Neural network (CBOW, Skip-gram + n-grams) Statistical (matrix factorization) Operating principle Trained to predict a word from its context or vice versa Expands on Word2Vec by adding subword n-grams Analyzes statistical co-occurrence of words Context type Local (word window) Local (word window) Global (co-occurrence) Morphological awareness No Yes (splitting words into subsequences) No Support for rare words Insufficient (no vectors for new words) Sufficient (creates vectors for unseen words) Insufficient (no vectors for new words) Vocabulary size Fixed (limited to words present in the corpus) Dynamic (generates vectors even for unseen words) Fixed Performance on small corpora Sufficient Excellent (due to n-grams) Insufficient (requires a large amount of data) Computational complexity Average Higher than in Word2Vec High (large-size matrices) Usage Classical NLP tasks, medium-sized corpora Morphologically rich languages, rare words Large corpora, global word analysis Main limitation Does not handle unseen words Slow performance Requires large resources Key components of the Embedding layer: 1. input layer (input indices) – receives integers representing word indices in the vocabulary; 2. weight matrix – a table of size (vocab_size, embedding_dim), where: a) vocab_size – the size of the vocabulary (i.e., the number of unique words); b) embedding_dim – the dimensionality of the word embedding (e.g., 100, 200, or 300); B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 14 3. embedded word vectors – for each input index, the corresponding row from the weight matrix is returned. The work utilizes six pre-trained models for the Embedding layer. The key features of these models are presented in Table 4. T a b l e 4 . Characteristics of the applied models No. Model Vectorizati on Type Vector Size Training Corpus Key Features Support for Unseen Words Library 1 word2vec- google- news-300 Word2Vec (CBOW, Skip-gram) 300 Google News (~100 billion words) General- purpose model for standard English No Gensim [12] 2 SO_vectors _200 Word2Vec 200 Stack Overflow (technical texts) Specialized for technical texts and programming [15] No Zenodo [16] 3 fasttext- wiki-news- subwords- 300 FastText (CBOW, Skip-gram + n-grams) 300 Wikipedia Accounts for morphology, generates vectors for unseen words Yes Gensim 4 glove- twitter-200 GloVe 200 Twitter (~2 billion words) Oriented toward social media and slang No Gensim 5 glove-wiki- gigaword- 50 GloVe 50 Wikipedia + Gigaword Compact and fast, lower quality for complex tasks No Gensim 6 glove-wiki- gigaword- 300 GloVe 300 Wikipedia + Gigaword Balances size and quality, suitable for NLP tasks [17] No Gensim In addition, a methodology based on a validation subset was used for model training. This involves dividing the entire available labeled dataset into non- overlapping parts: 20% of the training set was allocated to the validation subset, while the remaining portion was used for training the candidate models [18]. The implementation of the models mentioned above was carried out using the Python programming language (version 3.11.11) and the TensorFlow library (version 2.18.0). Table 5 presents a comparative analysis of TensorFlow and other popular machine learning libraries, such as PyTorch [19], Keras [20], and Scikit- learn [21]. Each of these libraries has its unique features and is suitable for different types of tasks. Google Colab [22] was selected as the execution environment due to its provision of cloud-based hardware resources for GPGPU and TPU technologies. Additionally, it supports the use of all the aforementioned machine learning libraries, making it a convenient platform for researchers and developers. Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 15 T ab le 5 . Fe at ur es o f m ac hi ne le ar ni ng li br ar ie s C ha ra ct er ist ic Te ns or Fl ow Py To rc h K er as Sc ik it- le ar n Y ea r o f r el ea se 20 15 20 16 20 15 20 07 D ev el op er G oo gl e Fa ce bo ok (M et a) G oo gl e (T en so rF lo w su bm od ul e) Fr an ço is Ch ol le t In st itu te Pr og ra m m in g la ng ua ge Py th on , C ++ , J av a, G o, Ja va Sc rip t, Sw ift Py th on , C ++ Py th on Py th on , C ++ M ai n us e D ee p le ar ni ng , n eu ra l ne tw or ks D ee p le ar ni ng , n eu ra l ne tw or ks In te rfa ce fo r Te ns or Fl ow /h ig h- le ve l A PI Cl as si ca l M L al go rit hm s, re gr es si on , cl as sif ic at io n CP U /G PU su pp or t Y es (s up po rts G PU th ro ug h CU D A , T PU ) Y es (G PU th ro ug h CU D A ) Y es (t hr ou gh T en so rF lo w o r Th ea no ) O nl y CP U (G PU su pp or te d th ro ug h w ra pp er s) M od ul ar ity H ig h (T en so rF lo w 2 .0 si m pl ifi ed ) H ig h (d yn am ic co m pu ta tio na l g ra ph s) H ig h (u se r-f rie nd ly A PI o n to p of Te ns or Fl ow ) M ed iu m (fo r t ra di tio na l M L m od el s) D yn am ic g ra ph N o (b ut th er e is th e Ea ge r Ex ec ut io n fe at ur e in T F 2. 0) Y es (b as ed o n Py To rc h) N o (r el ie s o n Te ns or Fl ow ) N o Ea se o f u se Re la tiv el y co m pl ex , b ut im pr ov in g w ith T F 2. 0 Si m pl e fo r p ro to ty pi ng , in tu iti ve V er y si m pl e (h ig h le ve l o f ab str ac tio n) V er y si m pl e (fo r cl as sic al M L ta sk s) Fl ex ib ili ty H ig h (lo w le ve l o f co nt ro l) H ig h (s im pl e co nt ro l ov er c om pu ta tio ns ) Li m ite d (h ig h le ve l o f a bs tra ct io n) M ed iu m (f oc us ed o n cl as sic al a lg or ith m s) Co m pu ta tio na l g ra ph s St at ic (b ut th er e is dy na m ic c ap ab ili ty ) D yn am ic (c re at ed d ur in g ex ec ut io n) St at ic (t hr ou gh T en so rF lo w o r Th ea no ) N o M ob ile d ev ic e su pp or t Y es (T en so rF lo w L ite ) N o Y es (t hr ou gh T en so rF lo w L ite ) N o M od el e xp or t Te ns or Fl ow S av ed M od el , H D F5 , T F. js, T F Li te To rc hS cr ip t, O N N X H D F5 , T en so rF lo w S av ed M od el Pi ck le , O N N X Co m m un ity a nd re so ur ce s La rg e, e xt en siv e of fic ia l do cu m en ta tio n, su pp or t fro m G oo gl e La rg e, a ct iv e co m m un ity , m an y ex am pl es La rg e co m m un ity th ro ug h Te ns or Fl ow /K er as V er y la rg e fo r cl as sic al M L In du st ry u sa ge W id el y us ed , e sp ec ia lly in la rg e pr oj ec ts (G oo gl e, U be r, A irb nb ) U se d fo r r es ea rc h an d pr ot ot yp in g W id el y us ed fo r r ap id d ev el op m en t V er y po pu la r i n ac ad em ic re se ar ch an d sta rtu ps B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 16 The use of the Google Colab platform revealed the following advantages: 1) access to powerful GPUs and TPUs. As is well known, graphical and tensor processing units significantly accelerate the training of deep learning models. Google Colab provides access to these computational resources, which is crucial for projects with high computational demands. This allows researchers to train resource-intensive models without the need to invest in expensive hardware; 2) the cloud-based environment of Google Colab operates directly in a web browser, eliminating the need for users to install any software on their local machines. All computations are performed on Google’s servers, while data and written scripts are automatically saved to Google Drive. This facilitates project storage and ensures convenient access from any device; 3) integration with GitHub: this feature allows users to open, edit, and save source code files directly within GitHub repositories; 4) support for major machine learning libraries: Google Colab comes pre- installed with popular libraries such as TensorFlow, Keras, PyTorch, Scikit-learn, Pandas, NumPy, among others. This enables users to start working immediately without spending time on package installation; 5) interactive Python environment: Colab utilizes Jupyter Notebook in the cloud, allowing users to write Python scripts, execute code in separate cells, and add visualizations, text blocks, and graphs. This interactive environment is ideal for data experimentation, machine learning model development, scientific research, and educational purposes. EVALUATION OF THE METHOD FOR ENSURING COMMENT QUALITY The following metrics were used to evaluate the quality of the commit message filtering [21]: 1) Accuracy (proportion of correct responses): 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 , 10 where TP – true positive responses; TN – true negative responses; FP – false positive responses; FN – false negative responses. 2) Precision – the proportion of true positive responses among all positive responses of the classifier: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃𝑇𝑃 + 𝐹𝑃 11 3) Recall – the proportion of true positive responses on positive objects: 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 12 4) F1-score – the harmonic mean: 𝐹 = 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 13 However, it should be noted that for the commit message filter, recall is of greater importance, as it is preferable to reject a message and suggest that the developer either supplement or rephrase it, rather than failing to identify a relevant message. Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 17 Thus, the results of the developed models are presented in Table 6. T a b l e 6 . Comparison of the performance of developed models for implement- ting the commit message filter No. Embedding Layer Model Metric MLP-Based Filter RNN-Based Filter 1 word2vec-google-news-300 Accuracy F1-score 0.856 0.624 0.855 0.626 2 SO_vectors_200 Accuracy F1-score 0.861 0.687 0.857 0.688 3 fasttext-wiki-news-subwords-300 Accuracy F1-score 0.859 0.701 0.858 0.694 4 glove-twitter-200 Accuracy F1-score 0.838 0.579 0.839 0.599 5 glove-wiki-gigaword-50 Accuracy F1-score 0.841 0.601 0.847 0.606 6 glove-wiki-gigaword-300 Accuracy F1-score 0.840 0.616 0.842 0.615 CONCLUSIONS This work proposes various approaches for implementing a commit message filter. This filter is one of the stages in the analysis, processing, and preparation of data for the method that will be capable of generating commit messages based on this data. A comparative analysis of the main methods for generating word embeddings, such as Word2Vec, FastText, and GloVe, has been conducted. Additionally, these methods were applied to the commit change filter based on binary classifiers MLP and RNN networks. An evaluation of the constructed commit message quality assurance filters has been performed. From the results table, it is evident that the best performance was achieved by the “classic” MLP network combined with the pre-trained technical text model SO_vectors_200 (accuracy 86.1%) and the fasttext-wiki-news- subwords-300 model (85.9%), which accounts for morphology and generates vectors for unknown words. It has been demonstrated that the Google Colab environment is a powerful tool for rapid prototyping and development of machine learning models. Due to its capabilities (GPU, TPU, and integration with cloud services), it becomes one of the most convenient tools for students, researchers, and software engineers. REFERENCES 1. S. Jiang, A. Armaly, C. McMillan, “Automatically generating commit messages from diffs using neural machine translation,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, (ASE), Urbana, IL, USA, 2017, pp. 135–146. doi: 10.1109/ASE.2017.8115626 2. R. Buse, W. Weimer, “Automatically documenting program changes,” in ASE '10: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, pp. 33–42, 2010. doi: https://doi.org/10.1145/1858996.1859005 B. O. Semonov, S. D. Pogorilyy ISSN 1681–6048 System Research & Information Technologies, 2026, № 2 18 3. P. Xue et al., “Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond,” IEEE Transactions on Software Engineering, vol. 50, no. 12, pp. 3208–3224, 2024. doi: 10.1109/TSE.2024.3478317 4. “How to Write a Git Commit Message,” cbeams. Accessed on: Oct. 07, 2024. [Online]. Available: https://cbea.ms/git-commit/#seven-rules 5. S. Pogorilyy, B. Semonov, “The Implementation of a Commit Messages Filter for Software Version Control Systems,” in The 9th International Conference on Control and Optimization with Industrial Applications, 2024, pp. 175–179. 6. “GitHub REST API,” GitHub Docs. Accessed on: Oct. 07, 2024. [Online]. Available: https://docs.github.com/en/rest?apiVersion=2022-11-28 7. N.V. Otten, “How To Use Text Normalization Techniques In NLP With Python [9 Ways],” Spot Intelligence. Accessed on: Oct. 07, 2024. [Online]. Available: https://spotintelligence.com/2023/01/25/text-normalization-techniques-nlp/ 8. A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, “Bag of Tricks for Efficient Text Classification,” 2016. [Online]. Available: https://arxiv.org/abs/1607.01759 9. “TensorFlow,” TensorFlow. Accessed on: Oct. 07, 2024. [Online]. Available: https://www.tensorflow.org/ 10. I. Loshchilov, F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the ICLR, 2019. Available: https://arxiv.org/pdf/1711.05101 11. N.K. Nissa, “Text Messages Classification using LSTM, Bi-LSTM, and GRU,” Medium. Accessed on: Oct. 07, 2024. [Online]. Available: https://nzlul.medium.com/the-classification-of-text-messages-using-lstm-bi-lstm-and- gru-f79b207f90ad 12. “Word2vec embeddings,” Gensim. Accessed on: Jan. 09, 2025. [Online]. Available: https://radimrehurek.com/gensim/models/word2vec.html 13. “Text classification,” fastText. Accessed on: Jan. 09, 2025. [Online]. Available: https://fasttext.cc/docs/en/supervised-tutorial.html 14. Ellie Arbab, “Global Vectors for Word Representation,” Medium. Accessed on: Jan. 18, 2025. [Online]. Available: https://medium.com/@ellie.arbab/glove-8849a40c08bc 15. V. Efstathiou, C. Chatzilenas, D. Spinellis, “Word Embeddings for the Software Engineering Domain,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), 2018, pp. 38–41. doi: 10.1145/3196398.3196448 16. V. Efstathiou, C. Chatzilenas, D. Spinellis, “Word Embeddings for the Software Engineering Domain (dataset),” Zenodo, Mar. 2018. doi: https://doi.org/10.5281/zenodo.1199620 17. Sciforce, “Word Vectors in Natural Language Processing: Global Vectors (GloVe),” Medium. Accessed on: Jan. 19, 2025. [Online]. Available: https://medium.com/sciforce/word-vectors-in-natural-language-processing-global- vectors-glove-51339db89639 18. K. Shridhar, H. Jain, A. Agarwal, D. Kleyko, “End to End Binarized Neural Networks for Text Classification,” Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 29–34. doi: 10.18653/v1/2020.sustainlp-1.4 19. “PyTorch documentation — PyTorch master documentation,” Pytorch.org. Accessed on: Oct. 07, 2024. [Online]. Available: https://pytorch.org/docs/stable/index.html 20. “Home - Keras Documentation,” Keras.io. Accessed on: Oct. 07, 2024. [Online]. Available: https://keras.io/ 21. “scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation,” Scikit-learn.org. Accessed on: Oct. 07, 2024. [Online]. Available: https://scikit- learn.org/ 22. “Colaboratory – Google,” research.google.com. Accessed on: Oct. 07, 2024. [Online]. Available: https://research.google.com/colaboratory/faq.html Received 24.02.2025 Design and evaluation of a quality assurance method for commit messages in version control systems… Системні дослідження та інформаційні технології, 2026, № 2 19 INFORMATION ON THE ARTICLE Bohdan O. Semonov, ORCID: 0009-0001-3692-9415, Taras Shevchenko National University of Kyiv, Ukraine, e-mail: bohdan.semonov@gmail.com Sergiy D. Pogorilyy, ORCID: 0000-0002-6497-5056, Taras Shevchenko National University of Kyiv, Ukraine, e-mail: sdp7799@gmail.com СТВОРЕННЯ МЕТОДУ ЗАБЕЗПЕЧЕННЯ ЯКОСТІ КОМЕНТАРІВ У СИСТЕМАХ КОНТРОЛЮ ВЕРСІЙ НА ОСНОВІ МОДЕЛЕЙ WORD2VEC, FASTTEXT ТА GLOVE / Б.О. Семьонов, С.Д. Погорілий Анотація. Обґрунтовано актуальність вирішення задачі забезпечення якості описів до внесених змін у вихідних текстах програм для систем контролю версій. Для здійснення фільтрації коментарів використано методи машинного навчання: нейронні мережі різної архітектури. Доцільним є використання нейронних мереж через необхідність пошуку описів до внесених змін, які відображають їхню мету. Виконано порівняльний аналіз методів створення векторних представлень слів, таких як Word2Vec, FastText і GloVe, та їх застосування у бінарних класифікаторах MLP і RNN для фільтрації змін. Здійснено навчання моделей на множині описів до внесених змін, отриманих за допомогою спеціального програмного інтерфейсу GitHub REST API. Виконано оцінювання точності моделей за допомогою метрик: точності (Accuracy) та середнього гармонійного (F1-score). Також підтверджено ефективність середовища Google Colab для прототипування моделей машинного навчання. Ключові слова: AdamW-алгоритм, commit message, GitHub REST API, GRU-шар, MLP, RNN-мережа, багатошаровий перцептрон, вихідний текст програми, ПЗ, повідомлення про внесені зміни, програмне забезпечення, рекурентні нейронні мережі, репозиторій, середнє гармонійне, система контролю версій.
id journaliasakpiua-article-365243
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2026-07-01T01:00:18Z
publishDate 2026
publisher The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
record_format ojs
resource_txt_mv journaliasakpiua/47/a6ebc150b61188abad4ed35f76adfc47.pdf
spelling journaliasakpiua-article-3652432026-06-30T06:14:59Z Design and evaluation of a quality assurance method for commit messages in version control systems using Word2Vec, FastText, and GloVe embeddings Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe Semonov, Bohdan Pogorilyy, Sergiy AdamW-алгоритм commit message GitHub REST API GRU-шар MLP RNN вихідний текст програми ПЗ повідомлення про внесені зміни репозиторій середнє гармонійне система контролю версій AdamW algorithm commit message GitHub REST API GRU layer MLP RNN source code software change description repository harmonic mean version control system This paper substantiates the relevance of addressing the problem of ensuring the quality of change descriptions in source code files within version control systems. To filter commit messages, machine learning methods are employed, including neural networks of various architectures. The use of neural networks is justified by the need to identify descriptions that accurately reflect the intent of the changes. A comparative analysis of word embedding methods (Word2Vec, FastText, and GloVe) was conducted, along with their application in binary classifiers such as MLP and RNN for filtering code changes. The models were trained on a dataset of change descriptions collected via the GitHub REST API. Model performance was evaluated using Accuracy and F1-score metrics. The effectiveness of the Google Colab environment for prototyping machine learning models was also confirmed. Обґрунтовано актуальність вирішення задачі забезпечення якості описів до внесених змін у вихідних текстах програм для систем контролю версій. Для здійснення фільтрації коментарів використано методи машинного навчання: нейронні мережі різної архітектури. Доцільним є використання нейронних мереж через необхідність пошуку описів до внесених змін, які відображають їхню мету. Виконано порівняльний аналіз методів створення векторних представлень слів, таких як Word2Vec, FastText і GloVe, та їх застосування у бінарних класифікаторах MLP і RNN для фільтрації змін. Здійснено навчання моделей на множині описів до внесених змін, отриманих за допомогою спеціального програмного інтерфейсу GitHub REST API. Виконано оцінювання точності моделей за допомогою метрик: точності (Accuracy) та середнього гармонійного (F1-score). Також підтверджено ефективність середовища Google Colab для прототипування моделей машинного навчання. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026-06-30 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/365243 10.20535/SRIT.2308-8893.2026.2.01 System research and information technologies; No. 2 (2026); 7-19 Системные исследования и информационные технологии; № 2 (2026); 7-19 Системні дослідження та інформаційні технології; № 2 (2026); 7-19 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/365243/350700
spellingShingle AdamW-алгоритм
commit message
GitHub REST API
GRU-шар
MLP
RNN
вихідний текст програми
ПЗ
повідомлення про внесені зміни
репозиторій
середнє гармонійне
система контролю версій
Semonov, Bohdan
Pogorilyy, Sergiy
Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title_alt Design and evaluation of a quality assurance method for commit messages in version control systems using Word2Vec, FastText, and GloVe embeddings
title_full Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title_fullStr Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title_full_unstemmed Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title_short Створення методу забезпечення якості коментарів у системах контролю версій на основі моделей Word2Vec, FastText та GloVe
title_sort створення методу забезпечення якості коментарів у системах контролю версій на основі моделей word2vec, fasttext та glove
topic AdamW-алгоритм
commit message
GitHub REST API
GRU-шар
MLP
RNN
вихідний текст програми
ПЗ
повідомлення про внесені зміни
репозиторій
середнє гармонійне
система контролю версій
topic_facet AdamW-алгоритм
commit message
GitHub REST API
GRU-шар
MLP
RNN
вихідний текст програми
ПЗ
повідомлення про внесені зміни
репозиторій
середнє гармонійне
система контролю версій
AdamW algorithm
commit message
GitHub REST API
GRU layer
MLP
RNN
source code
software
change description
repository
harmonic mean
version control system
url https://journal.iasa.kpi.ua/article/view/365243
work_keys_str_mv AT semonovbohdan designandevaluationofaqualityassurancemethodforcommitmessagesinversioncontrolsystemsusingword2vecfasttextandgloveembeddings
AT pogorilyysergiy designandevaluationofaqualityassurancemethodforcommitmessagesinversioncontrolsystemsusingword2vecfasttextandgloveembeddings
AT semonovbohdan stvorennâmetoduzabezpečennââkostíkomentarívusistemahkontrolûversíjnaosnovímodelejword2vecfasttexttaglove
AT pogorilyysergiy stvorennâmetoduzabezpečennââkostíkomentarívusistemahkontrolûversíjnaosnovímodelejword2vecfasttexttaglove