Project Title: Availability Aware Distributed Data Deduplication
Problem Statement:
In this project, we aim to reduce the resources like storage space, I/O disk operations of the cloud vendors which are used to store and manage a large volume of data. Also, we aim to provide an environment which is highly available and reliable.
Idea/Abstract:
The number of users using the cloud storage is increasing day-by-day and hence the data stored is also increasing in exponential rate. But a lot of data is duplicate since two or more users may upload the same data (For Ex. Files/Videos shared by peoples on social networking apps). Also to make the storage system reliable and highly available, cloud storage vendors create the redundant copies of the same data uploaded by the users through replication. This huge data has to be stored in the distributed environment of the group of servers. In order to provide the efficient solution to the above issues, we are proposing deduplication strategy in the distributed environment which will take care of reliability through replication as well as removal of duplicate data through duplicate detection. We present a versatile and practical primary storage deduplication platform suitable for both replications as well as deduplication. To achieve it, we have developed a new in-memory data structure which will efficiently detect the duplicate data and also take care of replication.
Data Structure Used in this Project:
Linked List and Hashing as our In-memory Data Structure And SHA Algorithm for Deduplication and Replication.
What is data de-duplication?
Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Data is analysed to identify duplicate byte patterns to ensure the single instance of the duplicate part is considered and stored in the server.
Why data de-duplication?
- It reduces the amount of storage needed for the given set of files.
- It reduces costs and increases space efficiency in the distinct storage environment.
- It reduces I/O disk operation.
- Post-process
- Inline process
- File-level Deduplication
- Block-level Deduplication
- MD5
- SHA256
- Load Balancing- For this problem, we have to create multiple main servers to balance the network traffic.
- In-Memory Hash Table– If the main server fails or it reboots then the whole system will be crashed. So, to solve this issue we have to make persistent storage which can take the snapshot of the whole In-Memory Data Structure after every update immediately.
- Support for file system commands like ls, chmod, chown.
- Prashant Sonsale (7276176311, prashantsonsale96@gmail.com)
- Anand Fakatkar (8237516939, aanandf@gmail.com)
- Nishant Agrawal (9921822904, nishant.agarwal050@gmail.com)
- Aditya Khowla (9762977289, aditya.khowala51295@gmail.com)