Огляд дедуплікації зображень для хмарного зберігання

Increased growth of real-life communication has motivated the creation, transmission, and digital storage of vast volumes of images and video data on the cloud. The explosive increase in virtual/visual image data on cloud servers requires efficient storage utilization that can be addressed using ima...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Datum:2023
Hauptverfasser: Chaudhari, Shilpa, Aparna, Ramalingappa
Format: Artikel
Sprache:Englisch
Veröffentlicht: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023
Schlagworte:
Online Zugang:https://journal.iasa.kpi.ua/article/view/273996
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Назва журналу:System research and information technologies
Завантажити файл: Pdf

Institution

System research and information technologies
_version_ 1867334432630767617
author Chaudhari, Shilpa
Aparna, Ramalingappa
author_facet Chaudhari, Shilpa
Aparna, Ramalingappa
author_institution_txt_mv [ { "author": "Shilpa Chaudhari", "institution": "Ramaiah Institute of Technology, Bangalore" }, { "author": "Ramalingappa Aparna", "institution": "Ramaiah Institute of Technology, Bangalore" } ]
author_sort Chaudhari, Shilpa
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2024-02-01T21:03:07Z
description Increased growth of real-life communication has motivated the creation, transmission, and digital storage of vast volumes of images and video data on the cloud. The explosive increase in virtual/visual image data on cloud servers requires efficient storage utilization that can be addressed using image deduplication technology. Even though the virtual and visual image properties are different, the existing literature uses a similar approach for deduplication checks, which motivated us to consider both image types for this review. This article aims to provide a detailed survey of state-of-the-art visuals as well as virtual image deduplication techniques in a cloud environment, summarizing and organizing them by developing a five-dimensional taxonomy for analysing the features and performance with several non-overlapping categories in each dimension. These include: 1) location of applying deduplication; 2) image feature extraction; 3) time of application; 4) image data partitioning strategy; 5) involvement of user dataset level. Existing image deduplication techniques are categorized into two main categories based on whether the technique involves security. A comparison of techniques is discussed across a set of functional and performance parameters. The current issues are highlighted with the possible future directions to motivate further research studies on the topic.
doi_str_mv 10.20535/SRIT.2308-8893.2023.4.09
first_indexed 2025-07-17T10:28:05Z
format Article
fulltext  S. Chaudhari, R. Aparna, 2023 Системні дослідження та інформаційні технології, 2023, № 4 113 TIДC ЕВРИСТИЧНІ МЕТОДИ ТА АЛГОРИТМИ В СИСТЕМНОМУ АНАЛІЗІ ТА УПРАВЛІННІ UDC 62-50 DOI: 10.20535/SRIT.2308-8893.2023.4.09 SURVEY OF IMAGE DEDUPLICATION FOR CLOUD STORAGE S. CHAUDHARI, R. APARNA Abstract. Increased growth of real-life communication has motivated the creation, transmission, and digital storage of vast volumes of images and video data on the cloud. The explosive increase in virtual/visual image data on cloud servers requires efficient storage utilization that can be addressed using image deduplication tech- nology. Even though the virtual and visual image properties are different, the exist- ing literature uses a similar approach for deduplication checks, which motivated us to consider both image types for this review. This article aims to provide a detailed survey of state-of-the-art visuals as well as virtual image deduplication techniques in a cloud environment, summarizing and organizing them by developing a five- dimensional taxonomy for analysing the features and performance with several non- overlapping categories in each dimension. These include: 1) location of applying deduplication; 2) image feature extraction; 3) time of application; 4) image data par- titioning strategy; 5) involvement of user dataset level. Existing image deduplication techniques are categorized into two main categories based on whether the technique involves security. A comparison of techniques is discussed across a set of functional and performance parameters. The current issues are highlighted with the possible fu- ture directions to motivate further research studies on the topic. Keywords: image deduplication, cloud computing, cloud storage, image copy detection. INTRODUCTION With the massive development of electronics and the internet, digital data is in- creasing at an alarming rate. This includes data in the form of text, images, vid- eos, sketches, etc. All this data comes from different parts of the Internet and hence causes information explosion due to huge velocity of data generation and huge variety of data sources. In 2007, it is said that the total digital resources of the world exceeded the global storage capacity for the very first time. Hence, it was decided that this problem of information explosion cannot be handled by simply increasing the amount of storage. But now it is estimated that by 2025, there will be 163.2 zettabytes of digital data [35]. Primary data generators like social networking platforms, industries and transactional data from various businesses are generating huge volumes of data every day. Due to the sudden increase in volumes of data, it becomes extremely crucial to be able to store this data in a cost-effective manner that optimizes stor- age. Cloud based infrastructure for on demand service provisioning from any- where, anytime is the popular solution used. The National Institute of Standards and Technology (NIST) reference architecture for cloud computing has following S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 114 five actors: 1) cloud consumer; 2) cloud provider; 3) cloud carrier; 4) cloud audi- tor; 5) cloud broker. The interaction among these actors is shown in Fig. 1 along with their activities and functions. Storage as a service on Cloud is one of the critical and popular services wherein the cloud storage provider provides cost effective and easy to access storage space on the cloud to the interested customers to host their data instead of maintaining it on their on-premises. The user data stored on the cloud can be in any form like images, audio, video etc. The customers or the data owners cannot rely on public service providers like cloud for data security. So, to provide secu- rity to the data, many researchers proposed secure storage techniques for storing the data on cloud. The cloud services are provided through virtual images whose size is very large requiring large amount of storage space in addition to huge network trans- mission requirements and reduction in operation time. Virtual images, each with large size starting with 1GB with different configurations may belong to a single cloud user. Almost 80% of the virtual image content is identical among these Vir- tual Machine (VM) images due to existence of similar data segments [1]. The two primary reasons namely sudden explosion of data and similarity in virtual images apparently induce a need in Cloud Service Providers (CSP) to optimize the stor- age and network bandwidth used in data transfer of VM images. Hence, a concept called deduplication was formulated, which could identify duplicates and delete all copies except one (or precisely retain as many copies as specified by deduplication ratio) [36]. It optimally minimizes the storage utiliza- tion by deleting redundant data from the cloud storage or data centers and thereby bringing down the unnecessary usage of network bandwidth [40]. It is a lossless data compression technology, which replaces duplicate image copies by using pointers to the unique image object. The image deduplication [40] process involves four steps to remove dupli- cate images or parts of images as follows [39]: 1) file chunking wherein the image is divided into fixed or variable size blocks known as chunks; 2) fingerprint gen- eration: the fingerprint will be computed using some transformation algo- rithm/technique such as hash function; 3) fingerprint lookup: the fingerprint of Cloud Consumer – uses service from Cloud Providers Cloud Auditor – conduct security audit, privacy impact audit and, performance audit of the cloud implementation Cloud Provider – makes the service available to the interested parties: • through various layers, resource abstraction and control layer, physical resource layer; • cloud service management including business support, configuration, service provisioning portability, and interoperability; • security and privacy Cloud Broker manages – the use, performance, and delivery of cloud services, and negotiates relationships between Cloud Providers and Cloud Consumers Direct request Request via Cloud Carrier – provides connectivity and transport of cloud services from Cloud Providers to Cloud Consumers Fig. 1. Cloud Actors Activities and Interaction in Cloud Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 115 already existing images will be compared with this newly created fingerprint in step 2 for identifying the duplicate file/block. If it is found to be same, this new file/block will be discarded otherwise it will be stored in the cloud storage; 4) data storage: store the unique image/block systematically on the cloud storage. Deduplication can be done at two levels: the single user level and the cross level. In the single user level, the deduplication is done keeping only one user in mind and duplication happens in their own storage. Cross level deduplication is when data is compared by taking from many users, and then redundant data is deleted. It can also be done at either the client side or the server side. At the client side, the data is checked for duplicates by the client itself and is then sent to the server. At the server side, the server collects all the information from the client and then duplicates are found and removed. Deduplication also has two feature-based methods known as global feature-based method and local feature-based method. There is no detailed investigation done till now to review image data dedu- plication techniques and its characteristics. Few non-standard articles exist ex- plaining data deduplication survey in unstructured way. The authors of [37] have classified the existing data deduplication techniques into two categories as source deduplication and target deduplication. Source deduplication is further classified into file-based and sub-file-based. Sub file is further considered as fixed or vari- able length. Target deduplication is further classified as post-process and inline. They have discussed another way for classifying data deduplication namely off- line and online deduplication. Even though many research articles exist, the paper discusses about only 10 research articles on data deduplication. The authors of [38] discuss 14 deduplication approaches without any taxonomy or relation among them. They have included 24 research articles under these approaches and compare them in terms of scalability, throughput, efficiency, amount of used bandwidth and cost. Many comparison parameters could have been considered along with some more deduplication techniques. Lack of systematic re- view/survey motivated us to do this work. In this survey paper, we survey different types of deduplications or copy de- tection that have been done for images in a systematic and structured way. As im- portant as it is, traditional deduplication mechanisms can only be used if two im- ages have the same bit stream, that is it can only be used if two images are completely the same. It does not apply for an image that has been cropped, ro- tated, or edited out. Automatic methods are now getting more attention with the increase in the redundant information. Also, cloud computing has proved to be very flexible and economical service provider that provide to maintain huge amount of data. In this world of immense data, users normally upload similar im- ages in different storages either due to storage restrictions or network restrictions. The aim of this paper is to understand and observe the different techniques used for image deduplication in terms of functional and performance parameters. Our contributions. Consequently, this significant amount of published re- search on Image deduplication requires some categorization to provide convenient overview of the current state of the art. To this end, we have developed multi- dimensional taxonomy to classify the Image deduplication research based on the properties supported in the research work as described in Section 2. Even though multi-dimensional taxonomy is used popularly for defining image deduplication techniques, we categorize them into two main categories based on whether secu- S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 116 rity is incorporated or not. Non-secure techniques are further categorized based on image type as virtual image or pixel image. Techniques in each category is dis- cussed across a set of functional and performance parameters. The presented tax- onomy allows us to analyze the Image deduplication research trends over time and various features supported in the work. To illustrate the usefulness of the provided classification, we discuss a detailed survey of the collected research arti- cles from extensive databases available online where Image deduplication-based references can be explored according to the designed dimensions and categories of the presented taxonomy. Our specific contributions are as follows: 1) design and discuss multi- dimensional taxonomies for comparison of the various parameters used in image deduplication; 2) explain the image deduplication research trends across two main categories based on whether security is considered or not. Non-secure techniques are further categorized based on image type as virtual image or pixel image; 3) compare the discussed image deduplication schemes in each category in terms of functional and performance parameters. The remaining part of this article is structured in various sections as follows. Section 2 explains the methodology for creating the taxonomy of Image dedupli- cation research work with its dimensions and categories. It also explains the Im- age deduplication-related research articles to analyze and provide trends on the distribution across the proposed dimensions. Section 3 presents a detailed survey of the key research findings and related comparison with respect to a set of func- tional and performance parameters related to Image deduplication. Section 4 ad- dresses the scope of the research on Image deduplication. Finally, conclusions are drawn in Section 5. DESIGN OF IMAGE DEDUPLICATION TAXONOMY AND CLASSIFICATION The taxonomy is aimed at classifying the work carried out in Image deduplication to have an in-depth understanding of the topic. Taxonomy construction varies from topic to topic, but all works in one class given in the taxonomy should be similar in the features or properties. The classification categories should be non- overlapping with well-defined limits between them. The taxonomy designed for Image deduplication related research for analyzing the features and performance includes five dimensions with several non-overlapping categories in each dimen- sion. These include: 1) location of applying deduplication; 2) image feature ex- traction; 3) time of application; 4) image data partitioning strategy; 5) involve- ment of user dataset level. Each dimension consists of a set of categories used to classify the existing image deduplication related articles. The presented taxonomy allows us to analyze the image deduplication research trends over time and various features supported in the work. A given article may not be mutually exclusive to the category as it may belong to one or more categories per dimension. The illustration of image deduplication taxonomy in graphical form is shown in Fig. 2. We have tried to minimize the possible overlap between the existing image deduplication tech- niques as per the proposed dimensions in this early stage of defining the classifi- cation categories. Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 117 The first dimension namely location of deduplication in proposed image de- duplication taxonomy further classifies the existing research works into three categories depending upon the place where deduplication is carried out. Since all the papers are based on client server architecture, cloud being the serving plat- form, the process of deduplication can be executed at the client side, server side or partly in both the places. We categorize the image deduplication techniques with respect to location of deduplication dimension into three classes as explained next: 1) server-side image deduplication: users upload the images to the cloud server in server-side image deduplication category and then the cloud service pro- vider will perform the image deduplication check on its cloud storage to identify whether newly arrived image already exists on the server or not; 2) client-side image deduplication: client will identify the existence of similar data on the cloud server before sending it entirely to the cloud in client-side image deduplication category. Server-side image deduplication reduces computational cost at the client side but with high bandwidth requirements; 3) hybrid location-based image dedu- plication: Image deduplication check may be partially done at client side whereas remaining check will be done at the server side in hybrid image deduplication category of this dimension. Second dimension named as image feature based, as proposed in image de- duplication taxonomy further classifies the existing research works into three categories depending upon the usage of local and global features of images for identification of deduplication. Image features are numerical values extracted from images that are used as discriminating information to distinguish various images or parts of images. Features are extracted for reducing the processing overhead as they are small when compared to image data. Global features of im- age describe the whole image to generalize the image data while local image fea- tures describe small group of pixels in image. Combination of both improves the accuracy of image recognition with the side effect of computational overhead. We categorize the image deduplication techniques in image feature-based dimensions Image feature 1. Global 2. Local 3. Combined Location 1. Client 2. Server 3. Hybrid Involvement of User dataset level 1. Single user 2. Cross user Image data level 1. File level 2. Block level Image Deduplication Dimensions Time of processing 1. Pre 2. Post Fig. 2. Taxonomy of Image Deduplication Techniques S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 118 into three classes namely: 1) global feature-based image deduplication: global features of an image are used to identify the image deduplication; 2) local feature- based image deduplication: local features of an image are used to identify the im- age deduplication; 3) combined feature-based image deduplication: local and global features of an image are used to identify the image deduplication. As per the proposed image deduplication taxonomy, third dimension identi- fied as Time of duplicate removal processing, further classifies the existing re- search works into two categories depending upon the time at which the duplicate data is removed for identified deduplicated images, as explained next: 1) inline image deduplication processing: the identification of deduplication is immediately started when cloud server receives the image without storing it. The deduplicated image/block of image is deleted before storing for achieving unique image data copy; 2) post-image deduplication processing: the received image will be stored in buffer on the cloud server first, then the deduplication check will be performed to identify the duplicate image/block of image. Only the unique images/blocks will be stored on the cloud server database/storage. This dimension namely Time of duplicate removal with respect to virtual image deduplication can be categorized into three categories namely deduplica- tion before backup, deduplication during backup and deduplication after backup. In the first case namely deduplication before backup, duplicate check is done be- fore performing the backup operation so that the size of the data transmitted would be that of the compressed image size. Here, both the fingerprint calculation and index lookup operation must be performed by the host node. In the second case namely deduplication after backup, deduplication check is performed after backing up the image. Since whole image is transmitted, the data transmission size would be large. In this case, storage node is the location for the fingerprint calculation and index lookup operation. The third case namely deduplication op- eration during backup aims at balancing the resource overhead at both the host side and storage side. Fourth dimension named as Image data level in the proposed image dedupli- cation taxonomy further classifies the existing research works into three catego- ries depending upon the whole image or part of image being used for identifica- tion of duplicates. The categories in this dimension are given as follows: 1) file- level image deduplication: the same image existing on the cloud server will be checked using the hash value created for each file based on the specific hash func- tion. If the received image hash value and one of the existing image hash values is same, then the received image will not be stored otherwise it will be stored on cloud server database; 2) block level image deduplication: the received image will be divided into blocks. Hash value is calculated for each block using specific hash function. The hash value for the block is called as block fingerprint. Only one block will be stored on cloud server for two or more blocks with same fingerprint. Otherwise, all blocks are stored on the cloud server; 3) hybrid-level image dedu- plication: both file – level and block level hash are checked for image deduplica- tion check. Fifth dimension named as involvement of user dataset level in proposed im- age deduplication taxonomy further classifies the existing research works into two categories depending upon the usage of user databases being scanned for check- ing identical images. Dataset used for checking image deduplication may belong to specific user or may have permission to store data of multiple users. We cate- Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 119 gorize the image deduplication techniques based on involvement of user dataset level dimensions into two classes namely: 1) single user level image deduplica- tion: image dataset belonging to a user is scanned to check the duplicate images for that user alone; 2) cross-user level image deduplication: Image databases of multiple users are scanned to check the duplicate image. Even though cross user level image deduplication generates higher deduplication ratio and is attractive in terms of storage cost in comparison with single user level image deduplication, it affects the privacy and security concern for users. Even though multi-dimensional taxonomy is used popularly for defining im- age deduplication techniques in the literature in a scattered way, we categorize them into two main categories based on whether security is incorporated or not. Non-secure techniques are further categorized based on image type as virtual im- age or pixel image as shown in Fig. 3. LITERATURE SURVEY ON IMAGE DEDUPLICATION TECHNIQUES This section discusses the two main categories designed in our taxonomy as shown in Fig.3 based on whether the techniques incorporate security or not. Non- secure approaches are further categorized for virtual image or pixel image types. Non-secure image deduplication techniques are described in Section 3.1 and Sec- tion 3.2 for virtual image types and pixel types respectively while Section 3.3 dis- cusses all secure techniques. As the proposed image deduplication techniques are not mutually exclusive to any specific dimension, one technique may belong to one or more dimensions. Image deduplication techniques Non-secure Image deduplication techniques For Virtual Image type For pixel Image type Secure Image deduplication techniques [1] [5] [6] [7] [19] [22] [30] [31] [32] [33] [3] [12] [14] [16] [18] [20] [21] [24] [25] [26] [27] [34] [2] [4] [8] [9] [10] [11] [13] [15] [17] [23] [28] [29] Fig. 3. Proposed Taxonomy for Image Deduplication Techniques S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 120 They are compared in terms of functionality based on the dimensions and per- formance parameters in the respective Section. Table 1 provides the functionality based on the dimensions and performance parameters used for comparison of the image deduplication techniques. NON-SECURE IMAGE DEDUPLICATION FOR VIRTUAL IMAGE TYPE Virtualization is a very important part of cloud computing. It allows multiple servers running on a single host and all disk contents are encapsulated in a single Virtual Machine (VM) Image. But this mechanism can store redundant data and cause storage issues. A lightweight virtual machine image deduplication backup approach in cloud environment is a technique used to eliminate this problem [1]. The process includes dividing the VM image into chunks and checking if the chunk has a fingerprint. The fingerprint is compared with the existing fingerprints in the fingerprint index table and if it exists then it is not entered otherwise the chunk is added in the storage system. The two problems faced are as follows: 1) if the fingerprint index table is long then it will take longer to compare the finger- prints; 2) the process can interrupt the other processes as different virtual ma- chines take the same runtime. This paper gives a classification method to reduce the fingerprint search by converting global duplication to local duplication and to improve index lookup. Two sampling methods are used to find the proper group to perform the deduplication operation in the virtual machine image. A numerical method is also used to calculate the ample space size. Deduplication rate is 10.2%. T a b l e 1 . Comparison parameters for image deduplication techniques SN Functional parameter Remark 1 Matching Algorithm Used Scale invariant feature transform (SIFT), principal com- ponent analysis (PCA) for SIFT, min-hash algorithm, feature extraction, high dimension indexing, accuracy optimization, centroid selection, deduplication evaluator, mean median, standard deviation, hash, map reduce, CRC, pixel based, special layout, visual similarity 2 Location Client side, server side, hybrid 3 Processing time Post-process, inline 4 Feature-based Local, global, structural features, visual model features, and feature points 5 Image data level File, Block,Hybrid 6 Block size Fixed length, variable length 7 Image content type Pixel image (PI) or Virtual Machine Image (VMI) 8 Metadata overhead The extra information required to be stored along with actual image data 9 Optimization objective The goal of the proposed image deduplication technique Cloud environment parameter 10 Cloud/OS - Cloud software If any specific cloud software used 11 Cloud type Public, private and hybrid cloud 12 Number of cloud images Finite number of images existing in the cloud database 13 User level Single user or cross user 14 Security provisioning Protocol Usage of cryptographic protocol, adaptation of crypto- graphic protocol Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 121 An improved k-means clustering method is implemented in Clustering-based acceleration for virtual machine image deduplication in the cloud environment [5]. This is the first paper to have image layout taken into consideration and pro- pose the method of small group merging and periodical triggering to store the vir- tual machine deduplication. Experimental results show the robustness and effi- ciency of this method. Deduplication rate is 89.74%. The number of virtual machine and images grows very rapidly and takes a lot of storage space out of which 90% of the data is redundant. The storage prob- lems caused is studied an Improved Image File Storage Method Using Data De- duplication [6] and discussions about employing deduplication and evaluation. A reference count for image is added to show reliability of image libraries. IM-dedup proposed in [7] transmits the unique blocks of image to the cloud server for reducing transmission time. The kernel file system with deduplication functionality in the image storage helps to manage the duplicated blocks through indexing. Client and server communicate within each other during the process of image deduplication. Deduplication-Enabled P2P based VM Image distribution protocol is intro- duced to speed up the provisioning in the VM [19]. Peer 1 contacts a tracker which sends it the list of its peers which also has an image of file A, it also shows the similarity Matrix between two peers. Scalable read/write throughput in RAID with deduplication capability to Ext4 file system is proposed in [22] as deduplication file system named as ScaleDFS. Parallel processing for fingerprints computation on multiple CPU core improves the write throughput. Deduplication cache improves read throughput and retrieve identical blocks easily. Reduced memory usage cache more finger- print information in memory. Deduplication is focused for single storage partition file system. Authors of [30] have proposed an adaptive deduplication mechanism, which performs fixed length and variable length block-level deduplication for reducing VM disk image file size is used in variable length block-level deduplication is implemented using Rabin–Karp rolling hash algorithm. Multithreading in AKKA framework is used to perform the deduplication and streaming since live migra- tion of VM disk image files is a bulky operation. QuickDedup [31] algorithm is proposed by authors to perform optimal de- duplication of VM disk images by reducing the number of hash computations and comparisons and by storing minimal metadata thereby reducing the overall dedu- plication time. In this approach, a novel byte comparison scheme to create various categories of blocks so that further the QuickDedup algorithm performs the calcu- lation of hashes and their comparisons within the respective categories only. Hence, hash storage space is minimized and comparing within categories speeds up the deduplication of VM disk images much. Authors of [32] propose a highly parallel deduplication cluster (HPDV) which optimizes VM images by considering the foreground quality of VM ser- vices and the background performance of deduplication for VM images. Gener- ally, chunk-based deduplication process involves four sub processes namely chunking, fingerprinting, fingerprint indexing and data storing. Authors of HPDV have parallelized the chunking, fingerprinting tasks which are compute-intensive and fingerprint indexing, which is I/O-intensive task, using the servers in the clus- S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 122 ter. Quality of the foreground VM services is ensured while parallelization is in progress by proposing a resource-aware scheduler (RAS) in this work. Authors of [33] have done an extensive review of several deduplication strategies and come up with a deduplication algorithm for VM images called De- dupCloud. The VM images are divided into blocks and stored sequentially in the preprocessing stage. Then the blocks are categorized based on hashes of blocks derived using SHA-3 hash function. We compare the above discussed non-secure image deduplication techniques for virtual image type in the form of three table. Table 2 gives the comparison in terms of parameters related to algorithms used and perspectives of various dimen- sions in the image deduplication technique. The comparison parameters include algorithm used for deduplication check, location where the deduplication takes place, application of deduplication, usage of image features, granularity of image data level, block size for block level granularity, content of the image, metadata overhead and objective function of the proposed technique. T a b l e 2 . Comparison of non-secure Image Deduplication techniques for virtual image type Pa- per Matching Algorithm Place Proc- essing time Fea- ture- based Data level Block size Metadata over- head Optimization objective 1 Improved k- means clus- tering algo- rithm with fingerprint Hybrid Post Hybrid Block Fixed Fingerprints of fixed size chunks In memory deduplication by using clustering and sampling of fingerprints – fingerprint search space optimization 5 Improved k- means clus- tering with statistical indices Server Post Local Block 4KB Fingerprints of fixed size chunks Reduce the fingerprint search space and im- prove the index lookup performance 6 MD5 index Server Post Local Hybrid Fixed 4/8KB MD5 code of entire image file, size of image file, the number of image blocks, file name, storage address of each image block and storage address of the final block Storage space reduction of VM images 7 MD5 and SHA-1 Client Inline Local Block Fixed 4KB Array of finger- prints and the ref- erence counter Reduction in VMI storage and transmission time 19 Bloom fil- ter’s hash function, Rabin fin- gerprinting scheme Hybrid Post Global Block Vari- able File block id, hash, and control mes- sages Minimize data access or transfer during VMI distribution in data centers 22 a POSIX- compliant, kernel-space driver mod- ule Server 42 VM images of dif- ferent Linux distri- bution Cross Block Both fixed and variable Cryptographic fingerprints of blocks, locality (hash) table that holds the full fin- gerprints and block numbers of the data blocks that correspond to the most recently ac- cessed fingerprint block Scalable read/write throughput in RAID to provide increased capacity, reliability, and performance. for storage Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 123 T a b l e 2 . Comparison of non-secure Image Deduplication techniques for virtual image type Pa- per Matching Algorithm Place Proc- essing time Fea- ture- based Data level Block size Metadata over- head Optimization objective 30 Rabin–Karp rolling hash algorithm Server Inline Cross Block Both fixed and variable Fingerprints of fixed size chunks and thread related metadata Reduction in image stor- age space and total mi- gration time and improvement in deduplication rate 31 SHA-1 Hash based on byte com- parisons, categorizes the blocks Server Post Local block meta- data Block Fixed block numbers, hashes Least number of hashes and comparisons, mini- mum metadata, and fast retrieval of VMs for deployment 32 Parallel fin- gerprint sub- indexes Server Post Global Block Fixed Fingerprints of fixed size chunks and thread related metadata Parallelizing chunking and fingerprinting tasks with multiple threads to speed up the tasks and Superior throughput with minimum interference on the foreground VM services 33 SHA-1 Hash based on byte com- parisons, categorizes the blocks Server Post Local block meta- data Block Vari- able Fingerprints of file chunks Time required for the deduplication of VMI and storage of VMI and metadata The performance analysis environment is discussed in Table 3 in terms of cloud software used, type of cloud, number of images in the cloud dataset, and user level involvement for accessing this database. Table 4 gives the advantages and disadvantages of the corresponding method. T a b l e 3 . Cloud type/ environment Parameters for non-secure Image Deduplica- tion techniques Pa- per Cloud/OS – Cloud software Cloud type Number of cloud im- ages User level 1 VM - Amazon EC2 Private 584 VMI Single 5 Aliyun - largest cloud of China and ISCAS – own cloud Aliyun-Public ISCAS - private Aliyun-Variable ISCAS- 584 Cross 6 Own cloud on a PC Private 11 VMI Cross 7 Openstack Private 35 VMI Cross 19 PeerSim P2P simulator Private Variable- max 30 Cross 22 Openstack Private 102 VMI Cross 30 OpenStack image registry with a standard configuration of 2GB memory and 10GB hard disk in CloudSim simulator Private 4 types of virtual images - VDI, VMDK, VHD, Raw, qcow2 images in total 2,426,552,114 Cross 31 Own configuration on Ubuntu 14.04 (64-bit) Private 10 VMI for Operating system Single 32 Own setup with 9-servers and 16 desktops as clients running Ubuntu 12.10 64 bit with Linux kernel version 3.5.0-17 Private 276 VMI Cross 33 Own configuration on Ubuntu 14.04 (64-bit) Private 10 VMI for Operating Single S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 124 T a b l e 4 . Advantages and disadvantages of non-secure Image Deduplication techniques Pa- per Advantages Disadvantages or Limitation 1 Reduce the virtual machine image deduplication backup time Slight storage space waste. 2 groups can have a high sample hit rate so when done in local deduplication, it can lead to duplicated blocks again 5 Work in both VHD and raw formats; accelerate the backup process Focuses on preprocessing phase than deduplication phase; little increment of disk space usage 6 Deletion rate for image groups which have the same version of operating systems, but different versions of software applications is up about 58% No backup system or Rapid indexing method 7 Uses the memory filter to reduce the overhead of disk index; improves the locality of data by centralizing fingerprints in disk to achieve a higher IO throughput rate with the limited memory occupancy rate Optimization of image download process not clearly given 19 30% performance gain. Image blocks are trades in two swarms. It also deals with hash collisions Lacks real-world environment 22 Parallel deduplication, deduplication cache and reduced memory Cloud platform is distributed environment, but ScaleDFS is single storage partition-based deduplication 30 Very good overall reduction in image storage space and total migration time are achieved when compared with the existing image management systems The reduction in size is dependent on the dataset and the applications running on the VM 31 Reduction in the metadata storage overhead and the number of hash computations thereby a smaller number of comparisons will be made so that overall deduplication time is reduced for the VM disk images Dataset used in not standard 32 Parallelization of compute-intensive chunking and fingerprinting, and the I/O-intensive fingerprint indexing will speed up the deduplication process Setting up of a cluster of deduplication servers is a costly investment 33 DedupCloud minimizes the number of hash value computations and comparisons within similar categories by using byte comparison technique Dataset used in not standard NON-SECURE IMAGE DEDUPLICATION FOR PIXEL IMAGE TYPE A High-precision duplicate image deduplication approach uses the 1-norm of gray block features of images to construct B+ tree index, and then detects the possible similar images by range query [2]. It compares the number of same elements in two images edges information. The fuzzy comprehensive evaluation method is used to select duplicate images by finding the centroid image. The size ratio of deduplicated images and total images is 9.7%. Cloud-scale image compression through content deduplication deals with combating the issues faced with storing storage costs with exponential increase in data [4]. It presents an image compression technique, which takes advantage by compressing each individual image with GIST nearest neighbor to overcome the scalability state-of-art issues. Image deduplications check on massive image file storage that includes dis- tributed database and file system is discussed in [8]. It uses MD5 based signature Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 125 on features of binary image stream instead of file –level or block level fingerprint check. To reduce storage space, the authors of [9] focuses on the Haar wavelet de- composition and Manhattan distance to select image duplication. When the num- ber of same elements between two collections is greater than or equal to the preset threshold t, they considered the two images are duplicate images. Deduplication of electricity bills is done using the content-based image re- trieval with block truncation coding [10]. This is used to categorize pictures of the electricity bills and blocks of images with the same sizes are clustered together. Each cluster is checked for duplicates, and they are a part of a big block. Deduplication image middleware detection comparison in standalone cloud database given in [11; 15] talks about techniques used in image deduplication in a standalone database. Most of the time people pay for more memory due to dupli- cation of images. This paper shows a new framework for the early stages of image deduplication in a cloud service. 11 software taken, which are either use stand- alone or cloud databases. A plugin is used to detect the duplication, which is still a new topic, but mobile Cloud detection has been around from 2008. In all the software used two out of 10 is that you have high detection of duplication and those are hash and Visual similarity. The focus of the paper is to allow users to select Software and Hardware to give them a better use of the cloud services. Large Scale Image Deduplication given in [13] deals with the problem of near Duplicate Image detection. Each duplicate in the database is linked with a Feature representation of it, what is called as a bundle. Two bundles join to form a feature of SPIHT, which is a robust technique, but become slower and gradually less accurate when the data in the database becomes larger. Maximally stable ex- treme regions algorithm is used for clustering as it is told to be better than the KNN means as it can also detect duplication when an image is cropped or rotated. Authors of [17] propose a similar file extraction method where a file with high similarity is extracted. To extract similar files, average hash method is used for determining file similarity. The execution time of deduplication process can be reduced by using only similar files for comparison. Variable length blocks of files are used in this method. The average hash method is used to find the duplica- tion of images. Morphological analysis and cosine similarity is used for the text Duplication. Results show that as the similarity percentage is increased, exact im- age duplicates can be determined. But the time taken for deduplication increases with increase in similarity percentage. Experimental results say that this method is very efficient to shorten the execution time. Recognition built on vocabulary tree with indexing scheme that quantized descriptions from image key points hierarchically, which is used for image simi- larity indication is described in [23]. Indexing descriptor is computed for local regions. The proposed recognition method handles large number of objects for selecting one of them within the acceptable time. Local image descriptors are based on video frames extracted. DBTP [28] i.e., Double Bytes Transport Protocol is used where double chunks are sent by the client to request for deduplication checks simultaneously, and the server responds to the deduplication requests. This scheme helps in miti- gating the side channel’s risk. S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 126 DriveHQ [29] is a website developed to perform efficient image deduplica- tion for optimized photo and video viewing. In this concept, images are divided into rectangular blocks and hash value is generated for each block using MD5.When similar images are uploaded, the images are divided into blocks and each block is checked with the stored hash values. If hash value is similar, then the images are considered as near identical images and not stored in the cloud to conserve space. We compare the above discussed non-secure image deduplication techniques for pixel image type in the form of three table. Table 5 gives the comparison in terms of parameters related to algorithms used and perspectives of various dimen- sions in the image deduplication technique like Table 2. The performance analysis environment is discussed in Table 6in terms like Table 3. Table 7 gives the advan- tages and disadvantages of the corresponding method. T a b l e 5 . Comparison of non-secure Image Deduplication techniques for pixel image type P ap er M at ch in g A lg or it h m P la ce P ro ce ss in g ti m e F ea tu re - b as ed D at a le ve l B lo ck s iz e O pt im iz at io n ob je ct iv e 2 1-norm of gray block fea- tures to construct B trees Server Inline Hybrid Block nxn Image blocks Duplicate images retrieval precision and deduplication accuracy 4 GIST nearest neighbor for compression Server Post Local File NA Image compression rates and reducing computational effort 8 MD5 Server Post Local Binary stream features Fixed Optimization of massive image files storage 9 Manhattan distance and Haar wavelet decomposi- tion Server Inline Local File Fixed size file Higher deduplication ratio, deduplication accuracy 10 Block truncation code Client Post Local Block Variable De-duplication process speed 11 Depend on the existing study technique Server Post Local File NA Storage space reduction 13 SIFT And Maximally Stable Extremal Regions (MSER), Server Post Local File NA Increased deduplication accuracy and performance 15 Deduplication image de- tector software of existing study or plugin for cloud storage Server Post Local File NA High-precision image deduplication 17 Average hash method, File similarity determination Server Post Local Hybrid Fixed/ variable Minimization of execution time for deduplication 23 Local regions indexing descriptors based on visual vocabulary tree Server Inline Local Block Fixed Improvement in retrieval quality 28 Double Bytes Transport Protocol with double chunks simultaneously Client Inline Local Block Fixed Mitigate the side channel’s risk and achieve high bandwidth efficiency of deduplication 29 MD5 Server Post Global Block Fixed Storage optimization NA-Not Applicable Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 127 T a b l e 6 . Cloud type/ environment Parameters for non-secure Image Deduplica- tion techniques for pixel image type Paper Cloud/OS – Cloud software Cloud type Number of cloud images User level 2 Corel image database Private Corel based 1000 PI Single/cross 4 Canonical set Private Dynamic, millions Cross 8 Own dataset Private Variable Single 9 Corel image database and selected images from www.picsearch.com Private 1000 PI Single/cross 10 Own dataset Private Variable Single 11 11 datasets – standard/own Private/ public Variable Single/cross 13 Two dataset – Dataset of [23] for Accu- racy and ILSVRC2010 for Performance Private Accuracy-10200; Performance- 1.2M Cross 15 11 datasets – standard/own Private/ public Variable Single/cross 17 Own cloud on a PC Private 90 bmp images Cross 23 Own setup Public 40000 images of popular music CD’s Cross 28 Python 3.7.6 platform and MySQL database Private Variable Cross 29 Own setup Private optimized photo and video viewing Single T a b l e 7 . Advantages and disadvantages of non-secure Image Deduplication techniques for pixel image type Paper Advantages Disadvantages or Limitation 2 Reduces workload of users. The fuzzy comprehensive evaluation allows the procession of selection of centroid images by visual reference The algorithm is unable to work on images which have been rotated, edited, blurred, or have a watermark excreta 4 Image processing rates reduces the effort used for computation by at least one order of magnitude Ideal Canonical set is not constructed 8 Signature generation and uploading speed is improved and offers an optimization to massive image files storage Massive image file storage distributed database without considering its deficiency 9 The proposed approach can achieve higher deduplication ratio and deduplication accuracy by setting suitable thresholds Methods can’t be used for images with similar structures 10 A single instance of the image in the database avoids confusion Error due to entire data compression at one time 11 Evaluation of existing software is given in detail Pilot test using standalone dataset is performed based on existing image deduplication detector 13 Method can be used even when images are cropped, rotated, or edited Size of visual word affects performance – too small may give false results and too large will be impossible to match in one SIFT mapping 15 Deduplication image detector such as plugin, middleware or software used for deduplication Compares standalone image deduplication detector 17 Both duplication and execution time was reduced Most of discussion is related to text files not images. The time taken for deduplication increases with increase in similarity percentage 23 entropy weighting of the vocabulary tree is defined with video independent of the database. Describes only retrieval process 28 DBTP implements two-side privacy to avoid side channel attack The deduplication ratio is a little reduced compared to existing methods 29 Images are divided into blocks and hashes are generated so that duplication check can use these stored hashes and detect near identical images Does not work for exact duplicates S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 128 SECURE IMAGE DEDUPLICATION Secure image deduplication through image compression in the cloud storages em- beds partial encryption to ensure security against a semi honest CSP and unique hashing to identify identical images into SPIHT compression algorithm [3]. Image compression followed by encryption and hashing in sequence reduces the compu- tational overhead, resources, and metadata to be stored. Authors of [12] discuss how cloud services have had an immense improve- ment in this year in terms of Secure Image Deduplication. Due to this great devel- opment in the services, many people have started storing data, which may also be redundant. Image duplication is necessary to save cost and space. This research has also used encryption called convergent encryption, which is got, by using the Hash Function on the image data the data is encrypted and decrypted with the same keys and hence the duplicates of the image will produce the same cipher text, by recognizing the duplicate of the cipher text, the image duplicates also found. Secure Image Data Deduplication through Compressive Sensing given in [14] presents a scheme by comparing the Compression Sensing (CS) and the SPIHT technique for image deduplication according to their experimental results it is shown that the CS Technique is more efficient and has more security than the other methods. They have also further studied that this technique can be used in video duplication as well since videos also take up a lot of data space in the cloud. The authors of [16] discuss an approach where client side takes an image, compresses it using the SPIHT compression, and partially encrypts it. It also takes the hash value of the image, and the user then uploads only the hash on to the server and the server side checks if the hash value is the same as the previous val- ues. If it is not, then it stores the hash value or it removes the hash to eliminate redundancy. An efficient approach towards image deduplication using Watson proposes a cost-efficient method of image duplication which has proven to reduce the storage of cloud services by one third its uses of WATSON and a MATLAB SSIM algo- rithm to do so [18]. In this technique when the user uploads an image it is sent to the WATSON visual where it image is given a tag and the image with the highest tag is sent to the database and is checked if other tags with the similar name is told. If it is so it is, then sent to the MATLAB SSIM to check if the images are similar or not. If the images already stored in the database, then it will not be stored again in the clients ‘profile in the Cloud Service or image is uploaded on the cloud servers and the user details are updated. Client-Side Secure Image Deduplication of [20] uses a dice protocol, which finds image deduplication in its block level. This research concludes that with all their experimental results images, which are more like each other, have smaller number of blocks in their storage. This however does not show what happens to images, which have been cropped, are in different lighting or have scaling and any other kind of Editing done on them. It also does not deal with file, which has been compressed into different formats. Data outsourcing model of [21] uses file level as well as block level dedupli- cation using Dekey convergent key management scheme. A user computes and sends the block tags to the cloud server, which stores only unique tags. The stored tags are informed to the client so that it can secure it and resend back to server. Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 129 The indexed information of this secured block is maintained at client also for fu- ture access. Secure data deduplicationusing radix trie and bloom filter discussed in [24] used existing hash-based deduplication technique. It starts with convergent en- cryption to avoid leakage of data followed by three stages – authorization dedu- plication using role re-encryption process, proof of ownership and role key up- date. Roles and keys are mapped with radix trie. Data updation and retrieval of ownership verification is done using bloom filter. BDKM [25] is a blockchain based approach to ensure confidentiality of outsourced data and reliability of Convergent Key (CK) management is enhanced by adopting an oblivious pseudorandom function to generate the randomized CK. Data reliability in BDKM is achieved by dividing the CK into segments and distributed to blockchain. This work can be extended by employing blockchain to implement a secure and efficient integrity verification on the data deduplication, where a user can verify the integrity of other users’ data without knowing any information about the data. Multistage for coarse to fine deduplication is proposed in [26]. The global features are comparing to find the duplicate initially followed by local features if no match found. Fine deduplication is applied using SHA 256 based Merkle hash tree. Local and global features work at file level while hash tree works at block level. The database is maintained for each file on dataset consisting of global fea- tures, local features, and hash tree details. Authors of [27] propose an in-line block matching-based data deduplication scheme with dynamic user management. Users encrypt their data using convergent encryption. Server uses in-line block matching protocol to generate unique proof by calculating the group key and re-encrypts the file using the group key. Another user uploading the same file will verify the proof against the server and re-encrypts using a new group key. Contents of the file remains confidential and even the server will not be aware of the file contents. Ownership list is maintained, and access control techniques are employed to prevent the access of cipher text from unauthorized users, cloud servers and adversaries. The analysis of the proposed scheme shows that the computational time, communication, and storage overhead is reduced when compared with the existing deduplication schemes. SEDS [34] scheme is proposed to provide a secure server sided data dedu- plication scheme for storing data in the cloud. This scheme generates constant size ciphertext, which is independent of the number of key servers, and cloud server performs proxy re-encryption to prevent semi-honest proxy server to transform the ciphertext. This scheme supports both intra-Key Server and cross- Key Server duplication check. Experimental analysis proves that the scheme is efficient compared to previous schemes with respect to computation and communication overheads and security. We compare the above discussed secure image deduplication techniques in the form of three tables. Like Table 2 and Table 5, Table 8 gives the comparison in terms of parameters related to algorithms used and perspectives of various di- mensions in the image deduplication technique. Similarly, the performance analy- sis environment is discussed in Table 9 wherein additional parameters are added named as security provisioning protocol used in addition to deduplication tech- niques. Table 10 gives the advantages and disadvantages of the corresponding method. S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 130 DISCUSSION AND CONCLUSION Image deduplication for duplicate check helps to reduce the communication and network transmission cost in the cloud environment. Most of the techniques work towards reduction of algorithm complexity through smaller hash size generation functions. We observed that image deduplication is either possible on pixel-based images or virtual machine images. There is no universal technique, which can be applied on both types of images. We presented taxonomy and classification of existing image deduplication related articles, which clearly shows that image. Table 8. Comparison of Secure Image Deduplication techniques Paper Matching Algorithm Place Process- ing time Fea- ture- based Data level Block size Content type Optimization objective 3 SPIHT compres- sion, partial en- cryption, and hashing Hybrid Inline Global File NA PI Ensure security against a semi honest CSP and compressed image de- duplication 12 attribute-based encryption Hybrid Post Global File NA PI Confidentiality, Privacy protection and com- pleteness 14 CSP - SHA basedduplicate images removal Hybrid Post Global File NA VMI Ensure security against a semi honest CSP and optimize storage space 16 Robust image hashing based on SPIHT Server Inline Local File NA PI Ensures data security against a curious and semi truthful CSP or any malicious user 18 WATSON and MATLAB SSIM algorithm Server Post Local File NA PI Reduction in time re- quired to perform dedu- plication 20 Dual Integrity Convergent En- cryption protocol Client Inline Local Block Fixed- PI Optimal block size determination for hashing and optimize storage space 21 DCT compres- sion and conver- gent encryption Hybrid Inline Local Hy- brid Fixed - 8x8 PI Ensure data security and optimize storage space 24 Radix Trie with Bloom Filter (SDD-RT-BF), hash function Client Inline Local Block Fixed PI/ Au- dio Maximizing deduplica- tion rate and ensuring security 25 Distributed Blockchain, SHA256, RSA Client Inline Global File /Block Fixed PI / text files Achieve secure and reliable Convergent Key management and resistance to the brute- force attack and collu- sion attack launched by the external adversaries 26 Merkle-Hash and Image Features Client Inline Local and global File/B lock Fixed PI Ensure data security and optimize storage space 27 Guillou- Quisquater identi- fication protocol with dynamic ownership man- agement, Conver- gent encryption Client, group Key server Inline — Block Fixed PI / text files Reduce network traffic and storage. Better ownership management 34 Convergent en- cryption and server re- encryption Server Inline — File NA PI / text files Better performance of deduplication algorithm NA- Not Applicable Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 131 T a b l e 9 . Cloud type/ environment Parameters for Secure Image Deduplication techniques Paper Cloud/OS - Cloud software Cloud type Number of cloud images User level Security provisioning Protocol 3 MATLAB environment dataset Private 252 own set Single Partial encryption 12 No specific cloud Any Variable Cross Convergent encryption 14 MATLAB environment dataset Private 6 Cross semi honest CPs 16 MATLAB environment dataset Private 10 PI Single Partial encryption 18 WATSON and MATLAB Private Variable Cross Password Protection 20 Own JAVA based environment Private 30 PI Cross Dual Integrity Convergent Encryption 21 Not given Private Not given Cross Convergent encryption 24 Java on Amazon EC2 serve Public Variable Single SHA-256 with radix trie and bloom filter 25 No specific cloud – own index server Private Variable Cross SHA 256, RSA for encryption 26 No specific cloud – own index server Private Variable Cross SHA 256 for Merkle hash tree 27 Own server Private Not given Cross Convergent encryption 34 Own set up with multiple servers Private Variable Cross Convergent encryption and server re-encryption T a b l e 1 0 . Advantages and disadvantages of secure Image Deduplication techniques Paper Advantages Disadvantages or Limitation 3 Save monumental amounts of computa- tional time and resources; Can find dupli- cate images even when images are ex- tremely similar and compressed Experimentation results are not derived in a real Cloud Service setting 12 In the paper ensure privacy protection and confidentiality The steps done by the client are too many 14 Efficient compression scheme makes the CSP store less data Small set of testing data with small image size 16 Great combination of analysis and security for storage Only same images can be deduplicated. No proof of Storage protocols 18 Cloud storage space usage after deduplica- tion has been reduced up to one third. Cost- effective Image is only removed if user has last access to it or only the image details are hidden from the user 20 Reduce communication and bandwidth cost Lacks real-world environment and diverse image dataset 21 DCT compression reduce storage space small encoding/decoding overhead 24 Client-side deduplication and Tag consis- tency preservation with Fault tolerance other queuing techniques and lightweight crypto- graphic algorithms could be used to improve performance 25 Blockchain ensures data reliability and se- cure key management Secure against the collusion attack with a limited overhead and blockchain can be extended to verify integrity of other users’ data without knowing the details 26 CNN is used to compare global and local features of stored images in the database with that of the incoming image and then additionally Merkle hash is used to check for duplicates Database storage is increased for multiple level comparisons 27 File integrity is achieved by using conver- gent encryption, in-line block matching protocol and group key management that hides file contents from unauthorized users, cloud servers and adversaries Generation of group keys and re encryption for subsequent uploads of the same file by other users is the overhead 34 Ensures data confidentiality, possession proof, resistant against tag inconsistency attack, cross-key server duplication check and scalability The scheme involves multistage key generation and encryption, the process is slower. Cloud server performs proxy re-encryption which is an overhead S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 132 Deduplication has considerable potential towards efficient cloud storage usage. The proposed taxonomy has proved a convenient means of grouping the available image deduplication research and giving insight on its contribution in terms of standard features supported in the image deduplication algorithms, cloud environment, advantages, and limitations. This survey explores published re- search works in greater depth related to the exploitation of features of the tech- nique used for the deduplication check. The existing image deduplication tech- niques neither use standard dataset as benchmark image dataset for performance evaluation nor have standard metric for similarity computation. Authors have their own way to consider performance environment. The optimization objective of the different algorithms is also listed here for researchers to get an overview of the goal of deduplication. The identified drawbacks can be scope for future re- search to work further for strengthen this area. REFERENCES 1. J. Xu, W. Zhang, S. Ye, J. Wei, and T. Huang, “A lightweight virtualmachine image deduplication backup approach in cloud environment,” in 2014 IEEE 38thAnnual Computer Software and Applications Conference, pp. 503–508. 2. M. Chen, S. Wang, and L. Tian, “A High-precision Duplicate Image Deduplication Approach,” JCP, 8(11), pp.2768–2775, 2013. 3. F. Rashid, A. Miri, and I. Woungang, “Secure image deduplication through image compression,” Journal of Information Security and Applications, 27, pp. 54–64, 2016. 4. D. Perra and J.M. Frahm, “Cloud-scale Image Compression Through Content Dedu- plication,” in BMVC, 2014. 5. J. Xu, W. Zhang, Z. Zhang, T. Wang, and T. Huang, “Clustering-based acceleration for virtual machine image deduplication in the cloud environment,” Journal of Sys- tems and Software, 121, pp.144–156, 2016. 6. Z. Lei, Z. Li, Y. Lei, Y. Bi, L. Hu, and W. Shen, “An Improved Image File Storage Method Using Data Deduplication,” in 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications, pp. 638–643. 7. J. Zhang et al., “IM-Dedup: An image management system based on deduplication applied in DWSNs,” International Journal of Distributed Sensor Networks, 9(7), p.625070, 2013. 8. S. Youjun and Z. Daxing, “Research on deduplication technology for massive image file storage,” Computer Applications and Software, 4, p. 15, 2014. 9. M. Chen, Y. Wang, X. Zou, S. Wang, and G. Wu, “A duplicate image deduplication approach via Haar wavelet technology,” in 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems, vol. 2, pp. 624–628). 10. A.J. Zargar, N. Singh, G. Rathee, and A.K. Singh, “Image data-deduplication using the block truncation coding technique,” in 2015 International Conference on Futur- istic Trends on Computational Analysis and Knowledge Management (ABLAZE), IEEE, pp. 154–158. 11. N. Yusof, A. Ismail, and N.A.A. Majid, Deduplication image middleware detection comparison in standalone cloud database. 12. H. Gang, H. Yan, and L. Xu, “Secure image deduplication in cloud storage,” in In- formation and Communication Technology-EurAsia Conference, pp. 243–251. Springer, Cham, 2015. 13. T.Y. Wen, Large Scale Image Deduplication. Available: http://vision.stanford.edu/ teaching/cs231a_autumn1213_internal/project/final/writeup/nondistributable/Wen_ Paper.pdf 14. F. Rashid and A. Miri, “Secure image data deduplication through compressive sens- ing,” in 2016 14th Annual Conference on Privacy, Security and Trust (PST), IEEE, pp. 569–572. Survey of image deduplication for cloud storage Системні дослідження та інформаційні технології, 2023, № 4 133 15. N. Yusof, N.A.A. Majid, and A. Ismail, “Framework deduplication image detection assisted multimedia system using multi technique,” in 2016 6th International Work- shop on Computer Science and Engineering, WCSE 2016, pp. 402–406. 16. S.P. Bini and S. Abirami, “Secure image deduplication using SPIHT compression,” in 2017 International Conference on Communication and Signal Processing (ICCSP), IEEE, pp. 0276–0280. 17. T. Koike, M.Z. Nurshafiqah, and T. Kinoshita, “Data Deduplication for Similar Im- age Files,” in Proceedings of the International Conference on Parallel and Distrib- uted Processing Techniques and Applications (PDPTA), pp. 296–301, 2018. 18. R. Aathira and V.P. Poonthottam, “An efficient approach towards image deduplica- tion using WATSON,” in 2017 International Conference on Inventive Computing and Informatics (ICICI), IEEE, pp. 180–183. 19. C. Lee, S. Kim, and E. Kim, “A Deduplication-Enabled P2P Protocol for VM ImageDistribution,” IEICE TRANSACTIONS on Information and Systems, 98(5), pp. 1108–1111, 2015. 20. A. Agarwala, P. Singh, and P.K. Atrey, “Client Side Secure Image Deduplication Using DICE Protocol,” in 2018 IEEE Conference on Multimedia Information Proc- essing and Retrieval (MIPR), IEEE, pp. 412–417. 21. M.S. Soofiya and S.V. Kumar, DCT Image Compression and Secure Deduplication with Efficient Convergent Key Management. 22. M. Ma, Kernel-space Inline Deduplication File Systems for Virtual Machine Image Storage; Doctoral dissertation, Chinese University of Hong Kong, 2013. 23. D. Nistr and H. Stewnius, “Scalable recognition with a vocabulary tree,” in IN CVPR, pp. 2161–2168, 2006. 24. S.E. Ebinazer and N. Savarimuthu, “An efficient secure data deduplication method using radix trie with bloom filter (SDD-RT-BF) in cloud environment,” Peer-to-Peer Networking and Applications, 14(4), pp. 2443–2451, 2021. 25. G. Zhang, H. Xie, Z. Yang, X. Tao, and W. Liu, “BDKM: A blockchain-based secure deduplication scheme with reliable key management,” Neural Processing Letters, pp. 1–18, 2021. 26. D.P. Akarsha, S. Chaudhari, and R. Apama, “Coarse-to-Fine Secure Image Dedupli- cation with Merkle-Hash and Image Features for Cloud Storage,” in 2021 Asian Conference on Innovation in Technology (ASIANCON), IEEE, pp. 1–6. 27. V. Kanagamani and M. Karuppiah, “Zero knowledge-based data deduplication using in-line Block Matching protocol for secure cloud storage,” Turkish Journal of Elec- trical Engineering & Computer Sciences, 29(4), pp. 2067–2083, 2021. 28. J. Ouyang, H. Zhang, H. Hu, X. Wei, and D. Dai, “Enhanced Deduplication Protocol for Side Channel in Cloud Storages,” International Journal of Network Security, 23(2), pp. 270–277, 2021. 29. S. Vinoth Kumar, L. Kruthika, K. Pooja, H.J. Priyanka, and N.R. Rachana, “Image Deduplication in DriveHQ Cloud,” Journal of Computational and Theoretical Nanoscience, 17(9-10), pp. 3895–3898, 2020. 30. N.M. Tyj and G. Vadivu, “Adaptive deduplication of virtual machine images using AKKA stream to accelerate live migration process in cloud environment,” Journal of Cloud Computing, 8(1), pp. 1–12, 2019. 31. S. Saharan, G. Somani, G. Gupta, R. Verma, M.S. Gaur, and R. Buyya, “QuickDe- dup: Efficient VM deduplication in cloud computing environments,” Journal of Par- allel and Distributed Computing, 139, pp. 18–31, 2020. 32. C. Lin, Q. Cao, J. Huang, J. Yao, X. Li, and C. Xie, “HPDV: A highly parallel dedu- plication cluster for virtual machine images,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), IEEE, pp. 472–481. 33. S.S. Patra, S. Jena, J.R. Mohanty, and M.K. Gourisaria, “DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment,” Data Deduplication Approaches: Concepts, Strategies, and Challenges, 281, 2020. 34. S.K. Nayak and S. Tripathy, “SEDS: secure and efficient server-aided data dedupli- cation scheme for cloud storage,” International Journal of Information Security, 19(2), pp. 229–240, 2020. S. Chaudhari, R. Aparna ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 134 35. D. Reinsel, J. Gantz, and J. Rydning, “Data Age 2025: The Evolution of Data to Life-Critical,” Seagate, an IDC White Paper 2017. 36. Q. He, Z. Li, and X. Zhang, “Data deduplication techniques,” in 2010 International Conference on Future Information Technology and Management Engineering (FITME), pp. 430–433. 37. Kirti Ashok Tayade and G.S. Malande, “Survey paper on a secure and authorized de- duplication scheme using hybrid cloud approach for multimedia data,” in 2017 In- ternational Conference on Energy, Communication, Data Analytics and Soft Com- puting (ICECDS), IEEE, pp. 2966–2969. 38. Shieh Fatemeh, Mostafa Ghobaei Arani, and Mahboubeh Shamsi, “De-duplication approaches in cloud computing environment: a survey,” International Journal of Computer Applications, 120, no. 13, 2015. 39. W. Xia et al., “A comprehensive study of the past, present, and future of data dedu- plication,” Proceedings of the IEEE, vol. 104, pp. 1681–1710, 2016. 40. “Data deduplication in the cloud explained, part one,” ComputerWorld. Accessed on: Dec 1, 2021. [Online]. Available: https://www.computerworld.com/article/ 2474479/data-deduplication-in-the-cloud-explained--part-one.html 41. “Data deduplication in the cloud explained, part two: the deep dive,” Computer- World. Accessed on: Dec 1, 2021. [Online]. Available: https://www.computerworld. com/article/2475106/data-deduplication-in-the-cloud-explained--part-two--the-deep- dive.html Received 21.02.2023 INFORMATION ON THE ARTICLE Shilpa Chaudhari, ORCID: 0000-0001-8659-4214, Ramaiah Institute of Technology, Bangalore, India, e-mail: shilpasc29@msrit.edu Ramalingappa Aparna, ORCID: 0000-0002-8093-916X, Ramaiah Institute of Tech- nology, Bangalore, India, e-mail: aparna@msrit.edu ОГЛЯД ДЕДУПЛІКАЦІЇ ЗОБРАЖЕНЬ ДЛЯ ХМАРНОГО ЗБЕРІГАННЯ / Шілпа Чаудхарі, Рамалінгаппа Апарна Анотація. Посилення комунікацій у реальному житті спонукало до створення, передавання та цифрового зберігання великих обсягів зображень і відеоданих у хмарі. Вибухове збільшення даних віртуальних/візуальних зображень на хмарному сервері потребує ефективного використання сховища, цьому по- сприяє технологія дедуплікації зображень. Незважаючи на те, що властивості віртуального зображення та візуального зображення розрізняються, наявна лі- тература використовує подібний підхід для перевірки дедуплікації, що спону- кало розглянути обидва типи зображень для цього огляду. Дослідження має на меті надати детальний огляд найсучасніших візуальних засобів, а також мето- дів дедуплікації віртуальних зображень у хмарному середовищі, узагальнюю- чи та організовуючи їх шляхом розроблення п’ятивимірної таксономії для ана- лізу функцій і продуктивності з кількома категоріями, що не перетинаються, у кожен вимір. До них належать: 1) місце застосування дедуплікації; 2) виділення ознак зображення; 3) час звернення; 4) стратегія розподілу даних зображення; 5) залучення рівня набору даних користувача. Наявні методи де- дуплікації зображень класифікуються на дві основні категорії залежно від то- го, чи передбачає цей метод захист чи ні. Порівняння методів виконано за на- бором функціональних і продуктивних параметрів. Поточні проблеми висвітлюються з можливими майбутніми напрямами для подальших дослі- джень цієї теми. Ключові слова: дедуплікація зображень, хмарні обчислення, хмарне сховище, виявлення копій зображень.
id journaliasakpiua-article-273996
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2025-07-17T10:28:05Z
publishDate 2023
publisher The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
record_format ojs
resource_txt_mv journaliasakpiua/4b/1914890b913ae5338e301c6dc32cf94b.pdf
spelling journaliasakpiua-article-2739962024-02-01T21:03:07Z Survey of image deduplication for cloud storage Огляд дедуплікації зображень для хмарного зберігання Chaudhari, Shilpa Aparna, Ramalingappa дедуплікація зображень хмарні обчислення хмарне сховище виявлення копій зображень image deduplication cloud computing cloud storage image copy detection Increased growth of real-life communication has motivated the creation, transmission, and digital storage of vast volumes of images and video data on the cloud. The explosive increase in virtual/visual image data on cloud servers requires efficient storage utilization that can be addressed using image deduplication technology. Even though the virtual and visual image properties are different, the existing literature uses a similar approach for deduplication checks, which motivated us to consider both image types for this review. This article aims to provide a detailed survey of state-of-the-art visuals as well as virtual image deduplication techniques in a cloud environment, summarizing and organizing them by developing a five-dimensional taxonomy for analysing the features and performance with several non-overlapping categories in each dimension. These include: 1) location of applying deduplication; 2) image feature extraction; 3) time of application; 4) image data partitioning strategy; 5) involvement of user dataset level. Existing image deduplication techniques are categorized into two main categories based on whether the technique involves security. A comparison of techniques is discussed across a set of functional and performance parameters. The current issues are highlighted with the possible future directions to motivate further research studies on the topic. Посилення комунікацій у реальному житті спонукало до створення, передавання та цифрового зберігання великих обсягів зображень і відеоданих у хмарі. Вибухове збільшення даних віртуальних/візуальних зображень на хмарному сервері потребує ефективного використання сховища, цьому посприяє технологія дедуплікації зображень. Незважаючи на те, що властивості віртуального зображення та візуального зображення розрізняються, наявна література використовує подібний підхід для перевірки дедуплікації, що спонукало розглянути обидва типи зображень для цього огляду. Дослідження має на меті надати детальний огляд найсучасніших візуальних засобів, а також методів дедуплікації віртуальних зображень у хмарному середовищі, узагальнюючи та організовуючи їх шляхом розроблення п’ятивимірної таксономії для аналізу функцій і продуктивності з кількома категоріями, що не перетинаються, у кожен вимір. До них належать: 1) місце застосування дедуплікації; 2) виділення ознак зображення; 3) час звернення; 4) стратегія розподілу даних зображення; 5) залучення рівня набору даних користувача. Наявні методи дедуплікації зображень класифікуються на дві основні категорії залежно від того, чи передбачає цей метод захист чи ні. Порівняння методів виконано за набором функціональних і продуктивних параметрів. Поточні проблеми висвітлюються з можливими майбутніми напрямами для подальших досліджень цієї теми. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023-12-26 Article Article Peer-reviewed Article application/pdf https://journal.iasa.kpi.ua/article/view/273996 10.20535/SRIT.2308-8893.2023.4.09 System research and information technologies; No. 4 (2023); 113-134 Системные исследования и информационные технологии; № 4 (2023); 113-134 Системні дослідження та інформаційні технології; № 4 (2023); 113-134 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/273996/290680
spellingShingle дедуплікація зображень
хмарні обчислення
хмарне сховище
виявлення копій зображень
Chaudhari, Shilpa
Aparna, Ramalingappa
Огляд дедуплікації зображень для хмарного зберігання
title Огляд дедуплікації зображень для хмарного зберігання
title_alt Survey of image deduplication for cloud storage
title_full Огляд дедуплікації зображень для хмарного зберігання
title_fullStr Огляд дедуплікації зображень для хмарного зберігання
title_full_unstemmed Огляд дедуплікації зображень для хмарного зберігання
title_short Огляд дедуплікації зображень для хмарного зберігання
title_sort огляд дедуплікації зображень для хмарного зберігання
topic дедуплікація зображень
хмарні обчислення
хмарне сховище
виявлення копій зображень
topic_facet дедуплікація зображень
хмарні обчислення
хмарне сховище
виявлення копій зображень
image deduplication
cloud computing
cloud storage
image copy detection
url https://journal.iasa.kpi.ua/article/view/273996
work_keys_str_mv AT chaudharishilpa surveyofimagededuplicationforcloudstorage
AT aparnaramalingappa surveyofimagededuplicationforcloudstorage
AT chaudharishilpa oglâddeduplíkacíízobraženʹdlâhmarnogozberígannâ
AT aparnaramalingappa oglâddeduplíkacíízobraženʹdlâhmarnogozberígannâ