Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду

This paper explores the application of imitation learning in caregiving robotics, aiming at addressing the increasing demand for automated assistance in caring for the elderly and disabled. While leveraging advancements in deep learning and control algorithms, the study focuses on training neural ne...

Full description

Saved in:
Bibliographic Details
Date:2024
Main Author: Tytarenko, Andrii
Format: Article
Language:English
Published: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2024
Subjects:
Online Access:https://journal.iasa.kpi.ua/article/view/322524
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:System research and information technologies
Download file: Pdf

Institution

System research and information technologies
_version_ 1866391926174908416
author Tytarenko, Andrii
author_facet Tytarenko, Andrii
author_sort Tytarenko, Andrii
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2025-02-09T21:55:38Z
description This paper explores the application of imitation learning in caregiving robotics, aiming at addressing the increasing demand for automated assistance in caring for the elderly and disabled. While leveraging advancements in deep learning and control algorithms, the study focuses on training neural network policies using offline demonstrations. A key challenge addressed is the “Policy Stopping” problem, which is crucial for enhancing safety in imitation learning-based policies, particularly diffusion policies. Novel solutions proposed include ensemble predictors and adaptations of the normalizing flow-based algorithm for early anomaly detection. Comparative evaluations against anomaly detection methods like VAE and Tran-AD demonstrate superior performance on assistive robotics benchmarks. The paper concludes by discussing further research in integrating safety models into policy training, which is crucial for the reliable deployment of neural network policies in caregiving robotics.
doi_str_mv 10.20535/SRIT.2308-8893.2024.4.07
first_indexed 2025-07-17T10:28:40Z
format Article
fulltext  Publisher IASA at the Igor Sikorsky Kyiv Polytechnic Institute, 2024 86 ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 UDC 004.852 DOI: 10.20535/SRIT.2308-8893.2024.4.07 DETECTING UNSAFE BEHAVIOR IN NEURAL NETWORK IMITATION POLICIES FOR CAREGIVING ROBOTICS A. TYTARENKO Abstract. This paper explores the application of imitation learning in caregiving robot- ics, aiming at addressing the increasing demand for automated assistance in caring for the elderly and disabled. While leveraging advancements in deep learning and control algo- rithms, the study focuses on training neural network policies using offline demon- strations. A key challenge addressed is the “Policy Stopping” problem, which is cru- cial for enhancing safety in imitation learning-based policies, particularly diffusion policies. Novel solutions proposed include ensemble predictors and adaptations of the normalizing flow-based algorithm for early anomaly detection. Comparative evaluations against anomaly detection methods like VAE and Tran-AD demonstrate superior performance on assistive robotics benchmarks. The paper concludes by dis- cussing further research in integrating safety models into policy training, which is crucial for the reliable deployment of neural network policies in caregiving robotics. Keywords: assistive robotics, reinforcement learning, diffusion models, imitation learning, anomaly detection. INTRODUCTION In recent years the fields of robotics and AI attracted lots of interest. The ad- vances in deep learning, robotics hardware, deep reinforcement learning, and imi- tation learning made it possible to solve complex control problems by training a neural network policy from mere hundreds of demonstrations. In this paper caregiving robotics is considered. Given the growing numbers of elderly and disabled people who need daily physical care [1; 2], the importance of automation rapidly increases. Caregiving (or assistive) robotics has a promise of addressing this problem, especially in the light of advances in control algorithms and hardware. As in most human-robot interaction scenarios, one of the biggest concerns in caregiving control algorithms is safety. This concern is especially important with neural network-based policies, which lack interpretability and are known to become unstable on out-of-distribution data [3]. For the case of imitation learning, this problem is visualized on Fig. 1. Fig. 1. Out-of-distribution data may lead to failures of a policy Detecting unsafe behavior in neural network imitation policies for caregiving robotics Системні дослідження та інформаційні технології, 2024, № 4 87 There are 4 episodes: A, B, C, D visualized as trajectories from an initial position marked as X to goal region. A and C are present in dataset. B is not present, but since it does not differ much from A and C, the algorithm is able to generalize. The episode D, however, is significantly different, and thus, a policy makes unexpected wrong decisions, failing the task. The progress in the field is nevertheless vast. [4] proposes a method for robotic arm for assistive manipulation tasks. It is a learning-based system, capable of learning from demonstration, based on Dynamic Movement Primitives (DMP) [5]. DMP is a vast framework that includes many instances. Although those methods give a potential for lifelong/incremental learning, they also rely careful modelling and are more difficult to implement and deploy. Paper [6] introduced simulation software for assistive manipulation tasks, named AssistiveGym. It comes with multiple predefined tasks (feeding, drinking, arm manipulation, etc.) and robots (Jaco, PR1, etc.) to pick. For the study, this simulator is chosen for its versatility, simplicity, and speed. The simulator also comes with a Proximal Policy Optimization-based (PPO, [7; 8]) baseline. In this work, an imitation learning-based approach is used for training a neural network policy. Imitation learning [9–11] allows to avoid the necessity of learning from interaction, by instead leveraging the offline data (demonstrations) collected using an existing policy or via teleoperation. The uncertainty estimation problem for Reinforcement Learning algorithms is studied in [12]. Although applied to a different task, the authors show that the uncertainty can be estimated using the log-likelihood and the variance of the model. The problem is, DDPMs in general, and Diffusion Policy specifically, is a generative model, for which calculating a likelihood for the generated plan is difficult [13], making the proposed approach hardly applicable for the considered problem. Other methods include [14–17]. In the following sections the “Policy Stopping problem” is studied and solutions are proposed. These solutions are compared to the application of out-of- box anomaly detection and uncertainty detection methods, proved to be successful in other domains. A system with a safety model and an imitation policy is developed and demonstrated. Lastly, the paper concludes with the discussion of the results and further research. PRELIMINARIES Markov Decision Process (MDP) is a collection ),,,( TrAS with S — state space, A — action space, ),( asr – reward function and ),|( 1 ttt assPT  — dy- namics. In this paper the reward is not assumed to be defined for full trajectories, classifying them as either “success” or “failure”. Reinforcement Learning (RL) algorithms optimize a policy  , which max- imizes the expected total reward of the MDP:      0 )(~ * ),(maxarg t ttp asr , where  is a trajectory ),...,,,( 1100 Tsasas sampled by applying a policy  . In offline setting (offline RL) an access to environment for collecting more interactions is assumed to be absent, and the whole training is conducted using only pre-collected demonstrations. Diffusion Policy is essentially a Denoising Diffusion Probabilistic Model (DDPM) which models a distribution )|( OAp , where O is a subset of prior ob- A. Tytarenko ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 88 servations, and A is a limited sequence of further actions, i.e. a short-horizon plan. Normalizing flow-based methods [18] estimate the data likelihood explicitly, by using a reversible block of various kinds. A trained network maps the input data x to latent space Z , such that the inverse mapping ))((1 xff  is trivially computable. METHOD Data collection. In this work imitation learning techniques are used to train a neural network policy. Imitation learning methods as a rule require pre-recorded trajectories, e.g. a dataset with sequences of a form: },...,1,),...,,,{( 1100 NisasasD iT  . Here N — is the number of trajectories and T — is a length of a trajectory. For collecting the trajectories, two methods are used — teleoperation and online reinforcement learning algorithms. Teleoperation is a fairly difficult task when it comes to robotic arm manipulation problems, especially in simulation. A keyboard-based teleoperation feature from the original AssistiveGym implementation is adapted for the task. The modified version is available via GitHub [19]. Online reinforcement learning algorithms allow training a policy neural network by interacting with an environment. They are usually way less sample- efficient, i.e. it takes much more data and training steps to learn a useful behaviour. Nevertheless, it is convenient in case of AssistiveGym, since some tasks are very difficult to teleoperate. Proximal Policy Optimization [7] algorithm is used, which is a well-established baseline Reinforcement Learning method, to collect useful trajectories for some of the tasks. Diffusion Policy for Assistive Robotics. Recent advances brought much more efficient imitation learning methods, such as Diffusion Policy [10] and Ac- tion Chunk Transformer [20]. Diffusion Policy, for instance, allows to train a relatively small neural network policy from up to 200–300 demonstrations in some cases [10]. Diffusion Policy fits a network capable of producing a plan of actions A from )|( SAP without explicitly learning it. More precisely, ),,...,( kTk ssS O ),,...,,...,( AO TkkTk aaaA  where k is a current time step, AT — action plan horizon, and OT — state (ob- servation) horizon. In this work, S is a concatenation of previous states, each of which is represented as a vector of real numbers, i.e. SN ts  . In the current study SN is a relatively small number (<100), although the method allows work- ing with larger-dimensional state spaces. This description also applies to the ac- tion plan A : aN ta  . The problem, however, is that it is difficult to compute a likelihood of a sample given a model only, which means that there is only a short-horizon plan A without any additional information. Detecting unsafe behavior in neural network imitation policies for caregiving robotics Системні дослідження та інформаційні технології, 2024, № 4 89 Although the method is known to be sample-efficient, it still highly depends on the quality of the dataset, i.e. state-space coverage, trajectory optimality, etc. See Fig. 1. Therefore, there are almost none guarantees that a deployed robotic policy won’t fail in unexpected ways, potentially damaging the hardware. Moreover, since the Caregiving Robotics deals with human-robot interaction, this may make the robot dangerous to a human, which is a critical in this domain. Policy stopping problem. In this study approaches to the “Policy Stopping” problem are proposed and compared. In it, an algorithm must decide whether a policy execution must be stopped immediately. This problem can be also viewed as an early anomaly detection problem. However, there is one important differ- ence. The stopping algorithm must be trained on offline data, generated by a be- havioural policy (a human demonstrator, a scripted policy, arbitrary neural net- work policy, or a mix), but tested on a data, generated by a different policy trained on that data (e.g. imitation learning algorithm). The key difference from traditional unsupervised anomaly detection is that an algorithm is conditioned on a dataset, generated by a distribution different from the test one. Therefore, such algorithm must balance the similarity of test trajectory and train trajectories, distinguishing between a good plan executed suc- cessfully but in unusual way and a bad plan that ends up in failure. State-prediction approach. The first approach considered is inspired by MBPO [14] and widely used in Reinforcement Learning algorithms for different purposes [21–23]. This approach uses a “disagreement” of an ensemble of next state prediction neural networks. The idea is that the next state prediction will be accurate and won’t vary much between networks in the ensemble if the input is in-distribution (familiar to the model). At the same time, a state-action pair may not be known. The reason may be that it was not present in a dataset or that a da- taset does not contain enough data for a predictor to generalize successfully to this state-action pair. Then, the next state predictors will “disagree”, which can be measured as a variance of some kind. Based on that principle, a network is trained, approximating a function ,)|( SASf  which predicts a vector of outT future states. For training, inputs and outputs are sampled from a collection of trajectories and a neural network is fit in a simple supervised way, minimizing the MSE (Mean Squared Error) objective: 2 2||),(||),,,( SASfSASLMSE   . Sampling is executed in a following way: ,~),...,( 0 demonT Dss },,...,{~ kTkk  ),,...,( kTk ssS in ),,...,( kTk aaA in ),,...,( 1 outTkk ssS  An ensemble of K models is trained, by initializing and fitting them inde- pendently on the same data. For estimating the level of uncertainty, a standard deviation between state predictions is computed by the following formula: A. Tytarenko ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 90 ,),( 1 ),( 1 1 ),( ..1 2 ..1                 Ki Kj lll ASf K ASf K ASU ji    SNSl l ESP ASUASU ||..1 ),,(),( where i ( Ki ..1 ) — parameters of neural networks in an ensemble. After computing the uncertainty level ESPU , an algorithm compares it to a manually tuned threshold and returns a decision for whether to stop an episode or not. See the Pseudocode 1 for details. Pseudocode 1. Training an ensemble state prediction model. 1. Input: Dataset demonD . 2. Initialize hyperparameters KTT outin ,, . 3. For EN epochs: 4. For Kj ..1 : 5. Sample SAS ,, . 6. Compute the MSE loss ),,,( SASLMSE . 7. Compute the gradients w.r.t. j , update the weights j . 8. End for. 9. End for. 10. Return: K ...1 . The considered approach follows [14] with a difference that the input to the state prediction function is not necessarily a single state-action pair, but a chunk, or an entire sub-trajectory. Although excessive due to the assumed Markovianess of the MDP, this allows to incorporate correlations between earlier states and de- cisions made by an agent, such the resulting neural network ensemble shall dis- agree when there are longer-term non-immediate anomalies in entire sub- trajectories and not only a single state-action pair. In other words, single state-action version computes ),( asU ESP , while the proposed one computes ),( ASU ESP . In this study a simple MLP (multi-layered perceptron) architecture is used for a single state-action version, and a CNN (convolutional neural network) is used for the proposed sub-trajectory version. Adapting anomaly detection methods based on normalizing flows. A promising approach in unsupervised anomaly detection is normalizing flows. In this paper, a method named MVT Flow [18] is considered. MVT Flow is designed for unsupervised anomaly detection in time series in a robotics domain. Using a convolutional neural network as a backbone, it is trained to estimate the likelihood of normal data. The anomaly score is then computed as a loss function of a test data w.r.t. the trained model. MVT Flow can’t be successfully applied to the presented problem out of box. Although [18] provides a method for credit assignment of elements of the series, it still requires a network to process the entire time series first. Thus, to adapt MVT Flow to early anomaly detection setting the following modification is proposed. Detecting unsafe behavior in neural network imitation policies for caregiving robotics Системні дослідження та інформаційні технології, 2024, № 4 91 Masking augmentation and sample weighting. The anomaly detection method MVT-Flow is (i) unsupervised and (ii) assigns an anomaly score to the entire input sequence. Therefore, applying it to a not finished sequence may be problematic. The neural net directly maximizes the likelihood of training data, so a previously unseen sequence will get a low likelihood score and will be considered anomaly. First, unfinished sequences are added to the training data, by randomly choosing a sub-episode length and removing all following elements from the episode. The problem, however, is that the actual abnormal trajectory may start as a normal one with only minor differences. Resulting model does not distinguish between a beginning of a normal trajectory and a fully normal trajectory, where clearly the likelihood should be different. So, second, a sample weighting is introduced to compensate for that effect: ,,max 0 minmax min            w KK KK w where maxmin ,, KKK are respectively a sub-episode length, a minimum sub- episode length and a full episode length. Intuitively, the ratio under the square root is a value which is 0 when the sub-episode is minimal and 1 when the sub-episode is full. The square root is applied to smooth the weights, making the difference between the full episode and minimal one smaller. Full algorithm. Pseudocode 2. Training an early-detection MVT-Flow model. 1. Input: Dataset demonD . 2. Initialize hyperparameters .,,,, 0maxmin wKKNE 3. For EN epochs: 4. Sample demonrr DAS ~, . 5. Sample random sub-episode length }.,...,{~ maxmin KKK 6. Compute masked data AS , : ||...1, SiIS Ki i   , ||...1, AiIA Ki i   7. Compute the MVT-Flow loss ),,( ASLMVT . 8. Compute the sample weight:            0 minmax min ,max w KK KK w . 9. Update weights: ),,(:   ASLw MVT . 10. End for. 11. End for. 12. Return: weights  . EXPERIMENTAL VALIDATION In this section the results of the study on several benchmarks of Caregiving Ro- botics are provided. All benchmarks are conducted using environments from the modified version of the AssistiveGym suit, available via GitHub [19]. A simulated Jaco robotic arm is used, the following assistive tasks are considered: Assistive Feeding (250 teleoperation deomnstrations), Assistive Bed Bathing (1000, PPO), Arm Manipulation (1000, PPO), and Scratch Itch (1000, PPO). A. Tytarenko ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 92 First, a policy network is trained using a diffusion policy algorithm on each collected dataset with trajectories. Next, on each dataset, the models of weighted-masked (WM) MVT-Flow, original MVT-Flow, ensemble state predictors (single state-action and sub- trajectory based), Variational Autoencoder (VAE) and Tran-AD. Weighted-masked MVT-Flow is trained with 1min K , 200max K , ,1.00 w 4108  , .85EN Other hyperparameters are kept in sync with [18]. Ensemble state predictors with 5K . For single state-action version take 1 outin TT , and for sub-trajectory based, take tTin  , 1outT . Here t means that all observations and actions observed up to a moment t are considered. Single state-action predictor is applied sequentially, and a maximum uncertainty score is taken as a resulting anomaly score. Variational Autoencoder has a small CNN backbone and KL penalty is set to 1. The anomaly score is set to the value of reconstruction loss of the input sub-episode. For Tran-AD the window size is set to 20. For evaluation, a Tran-AD network is inferred on all windows contained within the sub-episode and the resulting anomaly score is set to maximum anomaly score of every window. Every other hyperparameter remains unchanged from the original papers. To evaluate the quality of the proposed models, two kinds of metrics are reported: AUROC and FPR@TPR95. The former one is defined as an area under the Receiver Operating Characteristic curve. The later one is defined as the False Positive Rate on a threshold corresponding to 0.95 True Positive Rate. Both are common metrics in anomaly detection literature [24]. However, since the goal is to evaluate the early anomaly detection property, the metrics are reported for partial trajectories of various maximum lengths, namely 10%, 20%, 30%, 50%, 75%, and 100% of the maximum episode length. Better metrics on smaller percentages correspond to better earlier detection ability of an evaluated method. Tables 1–5 contain metrics reported when evaluated of each assistive environment datasets. Note, that for AUROC larger is better, while for FPR@TPR95 lower is better. T a b l e 1 . Evaluation on Assistive Feeding Method Metric 10% 20% 30% 50% 75% 100% FPR@TPR95 0.81 0.73 0.77 0.76 0.58 0.34 Single SP AUROC 0.79 0.80 0.79 0.81 0.90 0.92 FPR@TPR95 0.70 0.76 0.73 0.45 0.18 0.001 VAE AUROC 0.70 0.71 0.77 0.86 0.96 1.00 FPR@TPR95 0.72 0.37 0.55 0.23 0.06 0.02 MVT-Flow AUROC 0.70 0.80 0.80 0.94 0.98 0.99 FPR@TPR95 0.74 0.74 0.73 0.63 0.64 0.40 Tran-AD AUROC 0.65 0.70 0.69 0.79 0.89 0.92 FPR@TPR95 0.51 0.63 0.60 0.33 0.07 0.001 Sub-trajectory SP* AUROC 0.79 0.83 0.83 0.92 0.97 1.00 FPR@TPR95 0.64 0.61 0.50 0.21 0.06 0.001 WM MVT-Flow* AUROC 0.77 0.83 0.84 0.94 0.98 1.00 Assistive feeding is a simpler task, so most normal trajectories have a rela- tively short length. Therefore, it is expected that a good method gets maximum score on 100% of the environment length. Detecting unsafe behavior in neural network imitation policies for caregiving robotics Системні дослідження та інформаційні технології, 2024, № 4 93 T a b l e 2 . Evaluation on Arm Manipulation Method Metric 10% 20% 30% 50% 75% 100% FPR@TPR95 0.82 0.80 0.77 0.82 0.40 0.20 Single SP AUROC 0.73 0.72 0.79 0.78 0.89 0.95 FPR@TPR95 0.82 0.80 0.73 0.37 0.26 0.02 VAE AUROC 0.82 0.80 0.77 0.87 0.92 0.99 FPR@TPR95 0.84 0.76 0.43 0.16 0.08 0.01 MVT-Flow AUROC 0.72 0.75 0.88 0.95 0.97 0.99 FPR@TPR95 0.97 0.96 0.97 0.90 0.83 0.78 Tran-AD AUROC 0.43 0.43 0.45 0.51 0.57 0.65 FPR@TPR95 0.84 0.80 0.85 0.41 0.10 0.02 Sub-trajectory SP* AUROC 0.72 0.74 0.68 0.88 0.96 0.99 FPR@TPR95 0.72 0.71 0.67 0.16 0.07 0.03 WM MVT-Flow* AUROC 0.88 0.88 0.89 0.96 0.98 0.99 T a b l e 3 . Evaluation Assistive Bed Bathing Method Metric 10% 20% 30% 50% 75% 100% FPR@TPR95 0.88 0.91 0.94 0.95 0.90 0.87 Single SP AUROC 0.79 0.65 0.66 0.64 0.66 0.67 FPR@TPR95 0.84 0.79 0.79 0.76 0.54 0.85 VAE AUROC 0.68 0.63 0.63 0.70 0.81 0.97 FPR@TPR95 1.00 0.83 0.83 0.55 0.44 0.28 MVT-Flow AUROC 0.40 0.66 0.71 0.77 0.82 0.82 FPR@TPR95 1.00 0.88 1.00 0.94 0.89 0.89 Tran-AD AUROC 0.50 0.53 0.51 0.51 0.52 0.54 FPR@TPR95 0.88 0.87 0.80 0.82 0.72 0.001 Sub-trajectory SP* AUROC 0.77 0.74 0.74 0.66 0.71 1.00 FPR@TPR95 0.87 0.69 0.67 0.50 0.40 0.22 WM MVT-Flow* AUROC 0.72 0.77 0.77 0.81 0.86 0.94 Bed bathing dataset is challenging due to the low success rate of the demonstration policy. Therefore, the distribution of input trajectories may not cover most scenarios, limiting an imitation learning policy’s performance. T a b l e 4 . Scratch Itch Method Metric 10% 20% 30% 50% 75% 100% FPR@TPR95 0.84 0.89 0.79 0.73 0.70 0.56 Single SP AUROC 0.60 0.63 0.66 0.69 0.80 0.82 FPR@TPR95 0.95 0.89 0.84 0.77 0.29 0.17 VAE AUROC 0.60 0.61 0.66 0.75 0.88 0.92 FPR@TPR95 0.85 0.89 0.84 0.67 0.45 0.30 MVT-Flow AUROC 0.56 0.65 0.70 0.75 0.84 0.90 FPR@TPR95 0.72 0.60 0.70 0.81 0.67 0.55 Tran-AD AUROC 0.77 0.79 0.75 0.71 0.79 0.83 FPR@TPR95 0.88 0.84 0.82 0.68 0.38 0.07 Sub-trajectory SP* AUROC 0.56 0.60 0.77 0.80 0.87 0.93 FPR@TPR95 0.81 0.83 0.84 0.65 0.41 0.30 WM MVT-Flow* AUROC 0.74 0.78 0.79 0.79 0.86 0.91 A. Tytarenko ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 94 From Tables 1–4, on can make the following observations. First, Weighted-Masked MVT-Flow consistently outperforms raw MVT- Flow. For higher % of maximum length the raw version usually performs on par with the proposed modification, which is expected. This is because anomalous episodes take 100% of maximum length time, while most normal episodes are up to 50–75% of time. Second, Sub-trajectory SP performs on par with WM MVT-Flow on simpler environments, such as Feeding. It also outperforms single step predictors, espe- cially on larger time periods. Tran-AD models perform the worst on most datasets due to its windowed inputs. The only exception is Scratch Itch (Table 4). It is hypothesized that the reason for this is the smaller-scale nature of anomalies in the test trajectories. Now, a demonstration of a system with a diffusion policy deployed with a safety model is provided (see Fig. 2). In practice, a set of thresholds for each time period is selected, since the anomaly score for applied methods is non-decreasing. The lower part of the diagram shows a plot of the anomaly score (normal- ized), and arrows matching the upper images with corresponding time steps. Most of the time, the score is low, since the arm performs usual moves. The end of the plot shows a spike in anomaly score, resulting in the system halt. The anomaly is that the arm drops food and spins itself in unusual way. In the remaining of this episode, the arm would twist itself dangerously, potentially damaging hardware. CONCLUSION In this paper a challenging “Policy Stopping problem” is introduced and studied. This problem is important for improving safety of imitation learning-based neural network policies, specifically diffusion policies. The solutions specific to the introduced problem are proposed: ensemble of sub-trajectory-based state predictors and a modification of a recent MVT-Flow algorithm for early anomaly detection. The algorithms are evaluated and compared against ablated original unmodi- fied versions and known anomaly detection approaches, such as VAE and Tran- Fig. 2. Demonstration of the proposed approach on the Assistive Feeding environment Detecting unsafe behavior in neural network imitation policies for caregiving robotics Системні дослідження та інформаційні технології, 2024, № 4 95 AD. The proposed solutions are shown to be more suitable for the introduced problem and tend to outperform other methods on assistive robotics benchmarks. For the evaluation of early-detection capabilities the usual metrics have been adapted. Lastly, a system with a safety model and an imitation policy is developed and demonstrated. The interesting future work directions include integration of the proposed safety models to training of imitation policies (e.g. [21]), safe data collection for model finetuning, and adaptation of safety models to vision-based tasks. This may bring the safe and robust deployment of neural network policies, so important for caregiving robotics domain. REFERENCES 1. J. Broekens, M. Heerink, and H. Rosendal, “Assistive social robots in elderly care: A re- view,” Gerontechnology, vol. 8, no. 2, pp. 94–103, 2009. doi: https://doi.org/10.4017/ gt.2009.08.02.002.00 2. D.M. Taylor, “Americans with disabilities: 2014,” US Census Bureau, pp. 1–32, 2018. 3. Dan Hendrycks et al., “The many faces of robustness: A critical analysis of out-of- distribution generalization,” Proceedings of the IEEE/CVF international conference on computer vision, 2021. doi: 10.1109/ICCV48922.2021.00823 4. Clemente Lauretti et al., “Learning by demonstration for planning activities of daily living in rehabilitation and assistive robotics,” IEEE Robotics and Automation Let- ters, vol. 2, issue 3, pp. 1375–1382, 2017. doi: 10.1109/LRA.2017.2669369 5. Matteo Saveriano, Fares J. Abu-Dakka, Aljaz Kramberger, and Luka Peternel, “Dy- namic movement primitives in robotics: A tutorial survey,” The International Jour- nal of Robotics Research, vol. 42, issue 13, pp. 1133–1184, 2023. 6. Z. Erickson, V. Gangaram, A. Kapusta, C.K. Liu, and C.C. Kemp, “Assistive gym: A physics simulation framework for assistive robotics,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2020, pp. 10169–10176. 7. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint, 2017. doi: https://doi.org/10.48550/arXiv. 1707.06347 8. Jakhotiya Yash, Iman Haque, “Improving Assistive Robotics with Deep Reinforce- ment Learning,” arXiv preprint, 2022. doi: https://doi.org/10.48550/arXiv.2209.02160 9. Maryam Zare, Parham M. Kebria, Abbas Khosravi, and Saeid Nahavandi, “A survey of imitation learning: Algorithms, recent developments, and challenges,” arXiv pre- print, 2023. doi: https://doi.org/10.48550/arXiv.2309.02473 10. Chi Cheng et al., “Diffusion policy: Visuomotor policy learning via action diffu- sion,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2303.04137 11. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv pre- print, 2020. doi: https://doi.org/10.48550/arXiv.2006.11239 12. Vincent Mai, Mani Kaustubh, and Paull Liam, “Sample efficient deep reinforcement learning via uncertainty estimation,” arXiv preprint, 2022. doi: https://doi.org/10.48550/ arXiv.2201.01666 13. Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differ- ential equations,” arXiv preprint, 2020. doi: https://doi.org/10.48550/arXiv.2011.13456 14. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine, “When to trust your model: Model-based policy optimization,” Advances in Neural Information Process- ing Systems 32, 2019. doi: 10.48550/arXiv.1906.08253 15. Shunan Guo, Zhuochen Jin, Qing Chen, David Gotz, Hongyuan Zha, and Nan Cao, “Visual anomaly detection in event sequence data,” 2019 IEEE International Con- ference on Big Data (Big Data). doi: 10.1109/BigData47090.2019.9005687 16. Diederik P. Kingma, Max Welling, “Auto-encoding variational bayes,” arXiv pre- print, 2013. doi: https://doi.org/10.48550/arXiv.1312.6114 A. Tytarenko ISSN 1681–6048 System Research & Information Technologies, 2024, № 4 96 17. Shreshth Tuli, Giuliano Casale, and Nicholas R. Jennings, “TranAD: Deep trans- former networks for anomaly detection in multivariate time series data,” arXiv pre- print, 2022. doi: https://doi.org/10.48550/arXiv.2201.07284 18. Jan Thieß Brockmann, Marco Rudolph, Bodo Rosenhahn, and Bastian Wandt, “The voraus-AD Dataset for Anomaly Detection in Robot Applications,” IEEE Transac- tions on Robotics, 2023. doi: 10.1109/TRO.2023.3332224 19. A. Tytarenko, Assistive Gym Fork. 2024. Accessed on June 19, 2024. [Online]. Available: https://github.com/titardrew/assistive-gym 20. Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn, “Learning fine- grained bimanual manipulation with low-cost hardware,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2304.13705 21. Tianhe Yu et al., “Mopo: Model-based offline policy optimization,” Advances in Neural Information Processing Systems 33, pp. 14129–14142, 2020. Available: https:// proceed- ings.nips.cc/paper/2020/file/a322852ce0df73e204b7e67cbbef0d0a-Paper.pdf 22. Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims, "MOReL: Model-based offline reinforcement learning,” Advances in Neural Information Processing Systems 33, pp. 21810–21823, 2020. Available: https://proceedings. neu- rips.cc/paper_files/paper/2020/file/f7efa4f864ae9b88d43527f4b14f750f-Paper.pdf 23. Laura Smith, Yunhao Cao, and Sergey Levine, “Grow your limits: Continuous Im- provement with Real-World RL for Robotic Locomotion,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2310.17634 24. Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li, “Energy-based out-of- distribution detection,” Advances in Neural Information Processing Systems 33, pp. 21464–21475, 2020. Available: https://proceedings.neurips.cc/paper/2020/file/ f5496252609c43eb8a3d147ab9b9c006-Paper.pdf Received 11.07.2024 INFORMATION ON THE ARTICLE Andrii M. Tytarenko, ORCID: 0000-0002-8265-642X, Educational and Research Insti- tute for Applied System Analysis of the National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: titarenkoan@gmail.com ВИЯВЛЕННЯ НЕБЕЗПЕЧНОЇ ПОВЕДІНКИ В ПОЛІТИКАХ ІМІТАЦІЇ НЕЙРОМЕРЕЖІ ДЛЯ РОБОТОТЕХНІКИ ДЛЯ ДОГЛЯДУ / А.М. Титаренко Анотація. Досліджено застосування навчання за імітацією в задачах робототе- хніки для догляду, спрямоване на вирішення зростаючого попиту на автомати- зовану допомогу в обслуговуванні літніх людей і людей з інвалідністю. На підставі досягнень у глибокому навчанні та керуванні дослідження зосередже- но на навчанні стратегій, представлених нейронними мережами за допомогою попередньо зібраних демонстрацій. Однією з ключових проблем, яку вирішу- ється, є проблема «зупинки стратегії», що є важливою для підвищення безпеки в стратегіях, заснованих на навчанні імітацією, таких як дифузійні стратегії. Пропонуються рішення проблеми на базі ансамблів прогнозів стану та адапта- ції алгоритму на основі нормалізаційного потоку для виявлення аномалій на ранніх стадіях виконання. Порівняльний аналіз з методами виявлення анома- лій, такими як VAE та Tran-AD, демонструє перевагу в ефективності методів у задачах робототехніки для догляду. Запропоновано подальші напрями дослі- джень з інтеграції моделей безпеки в навчання нейромережевих стратегій, що є важливим для надійного впровадження нейромережевих рішень у робототе- хніку для догляду за людьми. Ключові слова: допоміжна робототехніка, навчання з підкріпленням, дифу- зійні моделі, навчання імітацією, виявлення аномалій.
id journaliasakpiua-article-322524
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2025-07-17T10:28:40Z
publishDate 2024
publisher The National Technical University of Ukraine &quot;Igor Sikorsky Kyiv Polytechnic Institute&quot;
record_format ojs
resource_txt_mv journaliasakpiua/da/9b134d2744c90c80a5ab9816d9908eda.pdf
spelling journaliasakpiua-article-3225242025-02-09T21:55:38Z Detecting unsafe behavior in neural network imitation policies for caregiving robotics Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду Tytarenko, Andrii assistive robotics reinforcement learning diffusion models imitation learning anomaly detection допоміжна робототехніка навчання з підкріпленням дифузійні моделі навчання імітацією виявлення аномалій This paper explores the application of imitation learning in caregiving robotics, aiming at addressing the increasing demand for automated assistance in caring for the elderly and disabled. While leveraging advancements in deep learning and control algorithms, the study focuses on training neural network policies using offline demonstrations. A key challenge addressed is the “Policy Stopping” problem, which is crucial for enhancing safety in imitation learning-based policies, particularly diffusion policies. Novel solutions proposed include ensemble predictors and adaptations of the normalizing flow-based algorithm for early anomaly detection. Comparative evaluations against anomaly detection methods like VAE and Tran-AD demonstrate superior performance on assistive robotics benchmarks. The paper concludes by discussing further research in integrating safety models into policy training, which is crucial for the reliable deployment of neural network policies in caregiving robotics. Досліджено застосування навчання за імітацією в задачах робототехніки для догляду, спрямоване на вирішення зростаючого попиту на автоматизовану допомогу в обслуговуванні літніх людей і людей з інвалідністю. На підставі досягнень у глибокому навчанні та керуванні дослідження зосереджено на навчанні стратегій, представлених нейронними мережами за допомогою попередньо зібраних демонстрацій. Однією з ключових проблем, яку вирішується, є проблема «зупинки стратегії», що є важливою для підвищення безпеки в стратегіях, заснованих на навчанні імітацією, таких як дифузійні стратегії. Пропонуються рішення проблеми на базі ансамблів прогнозів стану та адаптації алгоритму на основі нормалізаційного потоку для виявлення аномалій на ранніх стадіях виконання. Порівняльний аналіз з методами виявлення аномалій, такими як VAE та Tran-AD, демонструє перевагу в ефективності методів у задачах робототехніки для догляду. Запропоновано подальші напрями досліджень з інтеграції моделей безпеки в навчання нейромережевих стратегій, що є важливим для надійного впровадження нейромережевих рішень у робототехніку для догляду за людьми. The National Technical University of Ukraine &quot;Igor Sikorsky Kyiv Polytechnic Institute&quot; 2024-12-25 Article Article Peer-reviewed Article application/pdf https://journal.iasa.kpi.ua/article/view/322524 10.20535/SRIT.2308-8893.2024.4.07 System research and information technologies; No. 4 (2024); 86-96 Системные исследования и информационные технологии; № 4 (2024); 86-96 Системні дослідження та інформаційні технології; № 4 (2024); 86-96 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/322524/312904
spellingShingle допоміжна робототехніка
навчання з підкріпленням
дифузійні моделі
навчання імітацією
виявлення аномалій
Tytarenko, Andrii
Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title_alt Detecting unsafe behavior in neural network imitation policies for caregiving robotics
title_full Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title_fullStr Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title_full_unstemmed Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title_short Виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
title_sort виявлення небезпечної поведінки в політиках імітації нейромережі для робототехніки для догляду
topic допоміжна робототехніка
навчання з підкріпленням
дифузійні моделі
навчання імітацією
виявлення аномалій
topic_facet assistive robotics
reinforcement learning
diffusion models
imitation learning
anomaly detection
допоміжна робототехніка
навчання з підкріпленням
дифузійні моделі
навчання імітацією
виявлення аномалій
url https://journal.iasa.kpi.ua/article/view/322524
work_keys_str_mv AT tytarenkoandrii detectingunsafebehaviorinneuralnetworkimitationpoliciesforcaregivingrobotics
AT tytarenkoandrii viâvlennânebezpečnoípovedínkivpolítikahímítacíínejromerežídlârobototehníkidlâdoglâdu