Project Phase-1 Report Final

VISVESVARAYA TECHNOLOGICAL UNIVERSITY
Jnana Sangama, Santhibastawad Road, Machhe

Belagavi - 590018, Karnataka, India
PROJECT WORK PHASE- 1 (18CSP77) REPORT

ON
“Efficient Fault-Tolerant Data Recovery with Optimized Cauchy Coding”
Submitted in the partial fulfillment of the requirements for the award of the degree of
BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING
For the Academic Year 2023-2024
Submitted by
Adarsh 1JS20IS005
Prajwal Bhat KS 1JS20IS070
Pratheek R 1JS20IS075
Shrivara S Samaga 1JS20IS102
Under the Guidance of

Mrs. Sudha P R
Assistant Professor, Dept. of ISE, JSSATE
2023-2024
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING
JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060
JSS MAHAVIDYAPEETHA, MYSURU
JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060
DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that Project Work Phase - 1 (18CSP77) Report entitled “Efficient Fault-
Tolerant Data Recovery with Optimized Cauchy Coding” is a bonafide work carried out
by Adarsh [1JS20IS005], Prajwal Bhat KS [1JS20IS070], Pratheek R [1JS20IS075],
Shrivara S Samaga [1JS20IS102] in partial fulfillment for the award of degree of Bachelor
of Engineering in Information Science and Engineering of Visvesvaraya Technological
University Belagavi during the year 2023- 2024.
Signature of the Guide Signature of the HOD
Mrs. Sudha P R Dr. Rekha P M

Assistant Professor Professor& Head
Dept. of ISE Dept. of ISE
JSSATE, Bengaluru JSSATE, Bengaluru
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the successful completion of

any task would be incomplete without the mention of the people who made it
possible. So with gratitude, we acknowledge all those whose guidance and
encouragement crowned my effort with success.
First and foremost we would like to thank his Holiness Jagadguru Sri
Shivarathri Deshikendra Mahaswamiji and Dr. Bhimasen Soragaon, Principal,
JSSATE, Bangalore for providing an opportunity to carry out the Project Work
Phase – 1 (18CSP77) as a part of our curriculum in the partial fulfillment of the degree
course.
We express our sincere gratitude for our beloved Head of the department, Dr.
Rekha P M, for her co-operation and encouragement at all the moments of our
approach.
It is our pleasant duty to place on record our deepest sense of gratitude to our
respected guide Mrs. Sudha P R, Assistant Professor, for the constant
encouragement, valuable help and assistance in every possible way.
We are thankful to the Project Coordinators Dr. Nagamani N P, Assoc.

Professor and Mrs. Sahana V Asst. Professor, for their continuous co-operation
and support.
We would like to thank all ISE department teachers and non teaching staff
for providing us with their valuable guidance and for being there at all stages of our
work.
Adarsh [1JS20IS005]
Prajwal Bhat KS [1JS20IS070]
Pratheek R [1JS20IS075]
Shrivara S Samaga [1JS20IS102]
ABSTRACT
In the professional realm, the profound influence of big data is undeniably reshaping industries.
This transformative shift stems from the burgeoning data streams generated by an array of sensors
embedded in smart devices. Consequently, the imperative for a fault-tolerant data storage and
retrieval framework becomes evident. The specter of data loss, whether precipitated by natural
disasters, human lapses, or hardware malfunctions, looms large. Furthermore, an assortment of
security threats and data corruption attacks lurk, seeking to undermine storage disks and result in
partial or complete data loss.
This research endeavor is dedicated to advancing data encoding and recovery mechanisms. It
introduces an innovative approach known as Optimized Cauchy Coding (OCC), which leverages
matrix heuristics to create a set of matrices. The OCC methodology harnesses the Cauchy matrix
as a generator matrix within the framework of Reed Solomon (RS) code. This strategic choice
leads to more efficient data block encoding, characterized by a reduction in the number of XOR
operations and a subsequent decrease in the time complexity of the encoding algorithm.
The OCC system is designed to ensure data availability in the face of disk failures. In the event of
such an occurrence, the Code word can be utilized to retrieve missing data from any data block. In
terms of data recovery, the OCC approach demonstrates superior performance when compared to
the Optimal Weakly Secure Minimum Storage Regenerating (OWSPM-MSR) and Product-Matrix
Minimum Storage Regenerating (PM-MSR) methods.
TABLE OF CONTENTS
Chapter Title Page No

Chapter 1:Introduction 1
Chapter 2:Literature survey 2
Chapter 3:Problem identification 12
Chapter 4:Objectives 13
Chapter 5:Methodology 14
5.1 Use case diagram 14
5.2 Activity diagram 16
5.3 Data flow diagram 17
5.3.1 Encoding process 18
5.3.2 Decoding process 19
5.4 Sequence diagram 20
Chapter 6:System requirement and specification 22
6.1 Hardware requirements 22
6.2 Software requirements 22
Chapter 7: Expected outcome of the proposed project 23
References 24
TABLE OF FIGURES
Fig No Figure Name Page No

Fig 5.1.1 Use case diagram end user 15
Fig 5.1.2 Use case diagram cloud server 15
Fig 5.2 Activity diagram 16
Fig 5.3.1 Store procedure level-0 17
Fig 5.3.2 Download procedure level-1 19
Fig 5.4 Sequence diagram 20
Efficient Fault-Tolerant Data Recovery with Optimized Cauchy Coding
Chapter 1
INTRODUCTION
Nowadays, social media is the primary source of data generation that refer to as "big data," and it
generates tremendous amounts of data. It is both enormous in size and intricate in nature. As a
result of the current pandemic crisis, the use of the internet and smart phones is increasing every
day. As a result, numerous businesses generate a high volume of records in the Petabyte or Exabyte
range. Traditional systems are limited in their ability to capture, store, and analyse such large
amounts of data, according to Wang et al.
Large datasets can now be stored, compiled, and evaluated because of advancements in processing
and storage capabilities. Big data analytics offers novel approaches to analysing massive datasets.
Few applications are rigorously found in the domains of healthcare, finance, traffic management,
education, and retail .
Furthermore, the increase in data necessitates attention to several difficulties such as privacy,
integrity, and access control, all of which are required to protect data from various attacks such as
data degradation attacks and man-in-the-middle attacks. The current research uses a Cauchy matrix
generation method to build and create models that can be used to manage data recovery for
dispersed datasets.
Dept of ISE, JSSATEB 1

Chapter 2
LITERATURE SURVEY
[1] The escalating volume of big data is tackled through the introduction of an Optimized Cauchy
Coding method augmented with Reed Solomon (RS) coding. The imperative motivation behind
this study is to address the exigency for fault-tolerant data storage and retrieval, particularly in the
context of natural disasters, human errors, or mechanical failures. The OCC method seeks to
ameliorate the encoding process by curtailing read/write operations, diminishing XOR operations
during data recovery, and mitigating resource utilization challenges in distributed environments.
This systematically delves into the landscape of related work, encompassing diverse
methodologies such as the Cauchy matrix generation method, multi-cloud storage systems,
distributed data storage systems, and erasure-resilient codes. A robust comparative analysis is
presented, juxtaposing the OCC approach against established methods like Optimal Weakly
Secure Minimum Storage Regenerating (OWSPM-MSR) and Product-Matrix Minimum Storage
Regenerating (PM-MSR). The juxtaposition emphasizes the OCC's distinct advantages,
particularly its prowess in reducing encoding and decoding times.
The primary research contribution lies in the conception and development of the Optimized
Cauchy Coding (OCC) technique. Grounded in Reed Solomon (RS) coding, this method
strategically addresses the reduction of read/write operations during encoding, minimization of
XOR operations in data recovery, and the efficient handling of resource utilization issues within
distributed environments. Methodologically, it elucidates the encoding and decoding procedure
involving key elements such as clients, data files, data blocks, parity blocks, and server nodes. The
OCC technique intricately examines redundancy configurations during encoding, delicately
navigating the trade-off between matrix creation time and resource utilization in the distributed
setting. This provides an exhaustive exploration of fault-tolerant data storage and retrieval within
distributed systems. It unveils the Optimized Cauchy Coding (OCC) approach, leveraging Reed
Solomon (RS) coding, and meticulously details the methodology employed in its development.
The study underscores the significance of OCC in advancing the efficacy of data encoding and
recovery in the realm of distributed storage systems.

[2] In the of distributed file systems aimed at ensuring data availability and stability, diverse
techniques have been employed. Traditionally, storage methods in distributed file systems leaned
towards replication-based approaches; however, a shift has occurred towards the more
contemporary use of erasure coding (EC) techniques, primarily driven by concerns related to space
efficiency. While EC techniques offer substantial improvements in addressing space efficiency
challenges compared to replication, they are not without their own set of performance degradation
factors. These factors include complications in the encoding and decoding processes, as well as
issues related to input and output (I/O) efficiency.To tackle these challenges, this study introduces
a novel approach known as the buffering and combining technique within EC-based distributed
file systems. This technique revolves around the consolidation of various I/O requests that arise
during the encoding process, combining them into a single operation for more efficient processing.
By streamlining the encoding phase through the aggregation of multiple requests, this technique
seeks to alleviate performance issues inherent in EC-based systems.
Furthermore, the study proposes four distinct recovery measures to address the complications
arising during the decoding phase. The first measure focuses on the even distribution of disk
input/output loads, preventing potential bottlenecks and enhancing the overall efficiency of the
decoding process. The second introduces a random block layout strategy, aiming to optimize the
distribution of data blocks and bolster fault tolerance by reducing the impact of correlated failures.
The third measure employs a multi-thread based parallel recovery approach, leveraging parallelism
to concurrently execute decoding tasks. This parallel processing significantly reduces the recovery
time, contributing to improved system performance. Finally, the study introduces the matrix
recycle technique, which intelligently manages and recycles matrices during recovery, minimizing
computational overhead and further enhancing recovery efficiency. In essence, this study
addresses the challenges associated with EC-based distributed file systems comprehensively. The
buffering and combining technique optimizes I/O operations during encoding, while the proposed
recovery measures strategically distribute disk input/output loads during decoding, collectively
aiming to elevate system efficiency, fault tolerance, and overall performance in the context of
erasure coding.

[3] The prevailing utilization of erasure codes in distributed storage systems, characterized by
lower redundancy in comparison to replication methods, has prompted extensive research
primarily focusing on encoding techniques. However, a noticeable gap exists in the research
landscape, with limited studies addressing decoding methods. This paper seeks to bridge this gap
by introducing a novel erasure decoding method that exhibits generality, applicable to both
multivariate finite fields and binary finite fields.The proposed decoding method operates by
leveraging a transformative process involving the decoding transformation matrix, providing a
comprehensive approach to decoding failures. Remarkably, this decoding process is versatile,
accommodating both multivariate finite fields and binary finite fields. Notably, the method
introduces a convenient mechanism to circumvent the challenge of overburdened visiting through
minor adjustments, enhancing its practical applicability.
The theoretical underpinnings of the proposed decoding method are rigorously analyzed,
substantiating its correctness. To validate its efficacy, the paper includes experimental
comparisons with traditional methods. The results of these experiments underscore the superior
decoding efficiency and reduced reconstruction bandwidth offered by the proposed method. This
not only signifies the theoretical soundness of the method but also demonstrates its practical
advantages in comparison to established decoding approaches.In essence, this paper addresses the
underexplored realm of decoding methods in erasure codes for distributed storage systems. The
proposed decoding method stands out for its generality across different finite fields, its
transformative approach to decoding failures, and its demonstrated superiority in decoding
efficiency and reconstruction bandwidth in empirical comparisons. This research contributes not
only to the theoretical understanding of erasure decoding but also provides a practical and efficient
solution to decoding challenges in distributed storage environments.
[4] Optimization strategies for Reed-Solomon (RS) codes, widely employed in storage systems
for data recovery in the event of failures. Conventionally, popular software implementations rely
on parity check matrices, typically utilizing a Cauchy matrix padded with an identity or a
Vandermonde matrix. The encoding complexity of RS codes can be reduced through various
techniques, such as finding a Cauchy matrix with fewer '1's in its bit matrices or leveraging the
Reed-Muller (RM) transform in Vandermonde matrix multiplication. This introduces two novel
approaches that surpass previous schemes in terms of efficiency. In the first approach, diverse

constructions of finite fields are explored to further minimize the number of '1's in the bit matrices
of the Cauchy matrix. A new searching method is developed to identify matrices with the minimum
number of '1's, thereby enhancing the encoding complexity reduction.
The second approach innovatively defines RS codes using a parity check matrix in the format of a
Vandermonde matrix concatenated with an identity matrix. This format eliminates the
multiplication with the inverse erasure columns during encoding, streamlining the process.
Decoding benefits from simplified formulas. Notably, the Vandermonde matrix in this
unconventional RS code definition necessitates construction using finite field elements in a non-
consecutive order. The paper introduces a modification to enable the application of the RM
transform in this scenario, further reducing matrix multiplication complexity.
For 4-erasure-correcting RS codes over GF(2^8), the two proposed approaches yield substantial
improvements. The first approach enhances encoding throughput by 40%, while the second
approach achieves a 15% improvement on average compared to prior works based on Cauchy
matrix and Vandermonde matrix with RM transform, across various codeword lengths.
Importantly, these advancements extend beyond encoding efficiency; decoding throughput is also
significantly enhanced, contributing to the overall robustness and efficiency of RS codes in storage
systems.
[5] In the context of a distributed storage system, regenerating codes play a crucial role in ensuring
data availability while minimizing repair bandwidth. However, the trade-off involves an increased
risk of data eavesdropping on individual nodes. Previous research in this domain has primarily
focused on providing approximate analyses of security, often relying on information theoretic
security or weak security models. Some studies have further categorized weak security into block
security, albeit limited to the analysis of specific regenerating code schemes, without a
comprehensive examination of optimal block security.This study aims to fill this gap by
conducting a detailed analysis of the block security aspect of a specific regenerating code
scheme—specifically, a Cauchy-matrix-based product-matrix minimum storage regenerating
(MSR) scheme. The focus is on determining the optimal block security within MSR codes. The
analysis involves a thorough investigation into the vulnerabilities and strengths of the scheme
concerning data eavesdropping.

The research takes a step further by proposing an improved MSR code scheme designed to achieve
optimal block security. The proposed scheme addresses the identified vulnerabilities while
maintaining the efficiency of the regenerating code. The study substantiates its findings with
rigorous proofs, providing a robust foundation for the proposed improvements in achieving
optimal block security within the MSR codes.This study contributes to the understanding of
security issues in regenerating codes, specifically focusing on block security within a Cauchy-
matrix-based product-matrix MSR scheme. The research not only analyzes the existing scheme
but also proposes enhancements to achieve optimal block security, supported by detailed proofs
and analysis. This holistic approach provides valuable insights for designing secure and efficient
regenerating codes in distributed storage systems.
[6] Since the introduction of systematic polar codes (SPC) in 2011, various encoding algorithms
have been proposed to enhance their efficiency. However, the optimization of the number of
computing units involved in exclusive OR (XOR) operations has remained an area of
improvement. In response to this, the present study introduces an optimized encoding algorithm
(OEA) for SPC, leveraging the iterative property of the generator matrix and exploiting a specific
lower triangular structure.The proposed OEA aims to reduce the number of XOR computing units
compared to existing non-recursive algorithms. Notably, the iterative property of the generator
matrix and its distinctive lower triangular structure contribute to this reduction. The study provides
a rigorous proof that establishes the extensibility of this property to different code lengths and rates
within polar codes.
Through a meticulous process involving matrix segmentation and transformation, the OEA
achieves a submatrix with a significant proportion of zero elements. This strategic manipulation
of the matrix structure is designed to conserve computational resources. The study demonstrates
that the proportion of zero elements in the matrix can reach up to 58.5% using the OEA for SPC,
particularly when the code length and code rate are set at 2048 and 0.5, respectively. Furthermore,
the benefits of the proposed OEA extend to hardware implementation. In comparison to existing
recursive algorithms where signals are transmitted bidirectionally, the OEA offers advantages that
make it more suitable for efficient hardware realization. The study underscores not only the
theoretical advantages of the OEA in terms of reduced XOR operations but also its practical

implications for hardware implementation, making it a valuable contribution to the optimization

of systematic polar codes.
[7] The optimization of single failure recovery in large-scale storage systems, particularly when
erasure coding is implemented in a cluster file system (CFS), has been a focal point in ongoing
research efforts. This paper contends that existing designs for single failure recovery in such
systems exhibit limitations, including overlooking the bandwidth diversity inherent in CFS
architecture, being tailored to specific erasure code constructions, and lacking specialized
considerations for load balancing during recovery. In response, this study reexamines the single
failure recovery problem within a CFS context and introduces CAR, an innovative Cross-Rack-
Aware Recovery algorithm.
CAR addresses these limitations through several key features. Firstly, it recognizes and leverages
the bandwidth diversity property of the CFS architecture. For each stripe, CAR strategically
identifies a recovery solution that minimizes the need to retrieve data from a minimum number of
racks. Additionally, CAR minimizes cross-rack repair traffic by aggregating intra-rack data before
initiating cross-rack transmissions. Notably, CAR extends its optimization to multi-stripe
recovery, ensuring a balanced distribution of cross-rack repair traffic across multiple
racks.Evaluation results underscore the effectiveness of CAR in significantly reducing both the
volume of cross-rack repair traffic and the resulting recovery time. By offering a comprehensive
solution that considers the specific challenges posed by the CFS architecture, erasure coding, and
load balancing during recovery, CAR represents a substantial advancement in the quest to improve
the performance of single failure recovery in large-scale storage systems. This research not only
addresses current limitations but also paves the way for more efficient and robust recovery
mechanisms in the ever-evolving landscape of cluster file systems.
[8] In the context of distributed storage systems, the imperative need for system scaling arises
from the unprecedented surge in data volume. In these large-scale distributed storage
environments, fault protection becomes a critical consideration. Cauchy Reed-Solomon (CRS)
codes are widely adopted to withstand the impact of multiple simultaneous node failures. This
paper delves into the intricate problem of system scaling in the presence of CRS codes, formulating

it as an optimization model where both the post-scaling encoding matrix and the data migration
policy are assumed to be unknown a priori. To mitigate I/O overhead and optimize the scaling
process, the paper introduces a three-phase optimization scaling scheme tailored for CRS codes.
Initially, the scheme derives the optimal post-scaling encoding matrix under a given data migration
policy. Subsequently, it optimizes the data migration process using the selected post-scaling
encoding matrix. Finally, it leverages the Maximum Distance Separable (MDS) property to further
enhance the efficiency of the designed data migration process. Notably, this scaling scheme is
designed to necessitate minimal data movement while ensuring a uniform distribution of data.
Additionally, it requires reading fewer data blocks compared to conventional minimum data
migration schemes, yet it guarantees the minimum amount of migrated data. To validate the
efficiency of the proposed scaling scheme, the implementation is conducted atop a networked file
system, and extensive experiments are performed. The results highlight that the proposed scheme
not only achieves uniform data distribution and requires minimal data movement but also
outperforms the basic scheme in terms of scaling time. This research contributes a practical and
efficient solution to the challenge of scaling large-scale distributed storage systems, especially
when employing CRS codes, offering valuable insights for optimizing the scaling process in fault-
tolerant storage environments.
[9] In the realm of constructing maximum distance separable (MDS) codes, Vandermonde and
Cauchy matrices stand as common and efficient choices. However, practical scenarios often
demand additional design constraints beyond the MDS requirement, rendering these matrices
inadequate for certain coding problems. This paper addresses such challenges by discussing related
coding problems that emerge in diverse practical settings, where the conventional Vandermonde
or Cauchy matrices fall short.To overcome these limitations, the paper introduces a valuable
technique for tackling constrained coding problems. This involves the strategic use of random
selection in determining the evaluation points of a Vandermonde or a Cauchy matrix. The
incorporation of randomness adds a versatile dimension to code construction, enabling the
fulfillment of specific design constraints. Importantly, the proposed solutions operate effectively
within small finite fields, and their sizes are polynomial in the dimensions of the generator
matrices.

The significance of this technique extends beyond the specific coding problems discussed in the
paper, as it is anticipated to be a valuable tool for solving a broad range of constrained coding
problems. By introducing a flexible and randomized approach to matrix evaluation point selection,
the paper contributes a practical and adaptable solution to challenges encountered in the
construction of MDS codes under additional design constraints. This technique holds promise for
addressing diverse coding scenarios, thereby enhancing the applicability and versatility of code
construction methods in various practical settings.
[10] The advent of erasure codes, particularly Cauchy Reed-Solomon codes, has become
increasingly vital for ensuring fault-tolerance in SSD-based RAID arrays. However, when
implemented on a processor-based RAID controller, erasure coding relies on Galois Field
arithmetic for matrix-vector multiplication. This introduces heightened computational complexity
and results in a substantial number of memory accesses. In response to these challenges, this paper
delves into leveraging Resistive Random-Access Memory (ReRAM) to enhance erasure coding
performance in SSD-based RAID arrays. The proposed solution, termed Re-RAID, integrates
ReRAM as the main memory in both RAID and SSD controllers. Notably, Re-RAID enables the
processing of erasure coding directly on ReRAM, bypassing the traditional reliance on Galois
Field arithmetic. To optimize the encoding process, the paper introduces a novel confluent Cauchy-
Vandermonde matrix as the generator matrix.
One of the key advantages of Re-RAID is its ability to distribute reconstruction tasks for a single
failure to SSDs, empowering SSDs to efficiently recover data with the support of ReRAM
memory. Experimental results affirm the efficacy of this approach, showcasing a remarkable
improvement in both encoding and decoding performance. Specifically, the proposed Re-RAID
system demonstrates enhancements of up to 598× in encoding performance and 251× in decoding
performance, underscoring the transformative impact of ReRAM integration on erasure coding
efficiency in SSD-based RAID arrays. This research presents a noteworthy contribution by
harnessing emerging memory technologies to address the computational challenges associated
with erasure coding, paving the way for more robust and efficient fault-tolerance mechanisms in
storage systems.

[11] The ever-evolving landscape of heterogeneous data generation necessitates an advanced

NoSQL database system that can effectively accommodate this diversity. In the realm of NoSQL
databases, data is typically stored in a distributed manner across globally deployed shards. To meet
the demands of modern data storage, it is imperative that such databases prioritize high availability
while maintaining scalability and partition tolerance. However, a significant challenge in
distributed storage systems is addressing the issue of data skewness, which arises during the
distribution of data items across nodes in the system. To tackle the challenge of data skewness,
this paper proposes a novel approach centered around load balancing in a distributed environment.
The key strategy involves partitioning data into smaller, manageable chunks that can be
independently relocated. This departure from traditional data distribution methods allows for more
granular control over the movement of data within the system, enabling a more nuanced and
efficient approach to load balancing. By partitioning data into smaller units, the system gains the
flexibility to redistribute specific chunks independently, addressing the skewness issue and
optimizing the overall load distribution. This approach holds promise in mitigating the impact of
unevenly distributed data across nodes, ensuring that the distributed storage system remains robust,
scalable, and resilient to partition failures. As data continues to exhibit increasing heterogeneity,
solutions that address challenges in load balancing and data skewness become pivotal for the
seamless functioning of NoSQL databases in a distributed environment.
[12] In the realm of Big Data storage and analysis, Hadoop stands as a cornerstone tool for
researchers and scientists, utilizing the Hadoop Distributed File System (HDFS) for the storage of
massive datasets. HDFS employs a block placement policy to distribute large files into blocks
across a cluster in a distributed manner. Originally designed for homogeneous clusters, the advent
of heterogeneous nodes in modern networking has posed a challenge for Hadoop and HDFS,
calling for storage policies that can efficiently operate in both homogeneous and heterogeneous
environments.Traditionally, data locality in Hadoop aimed to map data blocks to processes on the
same node. However, the surge in Big Data has led to scenarios where it is essential to map data
blocks to processes across multiple nodes, introducing challenges such as performance degradation
due to I/O delays or network congestions, particularly on heterogeneous clusters. To address this,
the paper presents a novel algorithm designed to achieve more efficient load rearrangement among
nodes through custom block placement.

The proposed algorithm categorizes nodes into two types, such as homogeneous vs. heterogeneous
or high-performing vs. low-performing nodes, and balances data blocks accordingly. This custom
block placement policy enables better control over the distribution of data blocks, allowing
researchers to strategically place data where it is most beneficial for processing. By dividing total
nodes into categories based on performance characteristics, the algorithm optimizes data
placement, mitigating performance degradation and enhancing the overall efficiency of Hadoop in
both homogeneous and heterogeneous cluster environments. This research contributes to the
adaptability of Hadoop and HDFS in the face of evolving cluster architectures, ensuring their
continued efficacy in diverse and dynamic computing environments.
[13] In the last half-decade, the escalating demand for fault protection in large-scale storage
installations has driven extensive research and development efforts focused on erasure codes
tailored for scenarios involving multiple disk failures, surpassing the capabilities of traditional
RAID-5 configurations. This surge has given rise to numerous open-source implementations of
diverse coding techniques, offering a spectrum of choices to the wider community. In this paper,
a comprehensive head-to-head comparison of these implementations is conducted, specifically
evaluating their performance in encoding and decoding scenarios. The primary objectives are to
scrutinize various codes and implementations, ascertain the alignment between theoretical
expectations and practical outcomes, and emphasize the pivotal role of parameter selection,
particularly in relation to memory considerations, in influencing a code's performance.
Beyond the comparative analysis, the study aims to provide storage system designers with valuable
insights into the anticipated coding performance when conceptualizing and implementing storage
systems. By shedding light on the intricacies of different coding techniques and their real-world
implications, the research seeks to guide designers in making informed decisions that align with
their specific requirements. Additionally, the paper serves as a roadmap for identifying areas where
further research in erasure coding can yield the most substantial impact, highlighting avenues for
refinement and optimization.

Chapter 3
PROBLEM IDENTIFICATION
Many techniques are introduced to find a way to deal with fault tolerance with Cauchy matrix in
combination with RS code. Several existing approaches require a long computation time to
conduct XOR operations in encoding and decoding user data. In existing work, erasure codes like
Cauchy RS use Galois Field (GF) to accomplish matrix-vector multiplication, which causes CPU
overhead due to an exponential increase in read/write operations. An Optimized Cauchy Coding
method with RS coding is presented to overcome this problem. Proposed method makes it easier
to create Cauchy matrices because it reduces the number of XOR operations in data recovery.

Chapter 4
OBJECTIVES
i. The primary objective of the OCC approach is to enhance the efficiency of data coding
and decoding processes.
ii. Eensures that missing data from any data block can be recovered in the event of a disk
failure using the code word.
iii. To reduces the time complexity of the encoding algorithm, making it more efficient.
iv. To ensure data integrity and availability even in the face of natural disasters, human
errors, or mechanical failures.
v. To demonstrate that OCC outperforms the other methods in terms of both encoding and
decoding speed.

Chapter 5
METHODOLOGY
5.1 Use Case Diagram
A use case chart is a kind of behavioral graph made from a Use-case examination. Its object is to
present a graphical diagram of the usefulness gave by a framework regarding performers, their
objectives (spoke to as utilization cases), and any conditions between those utilization cases. Use
case chart gives us the data about how that clients and utilization cases are connected with the
framework. Use cases are used amid prerequisites elicitation and examination to speak to the
usefulness of the framework. Use cases concentrate on the conduct of the framework from an
outside perspective.
A use case depicts a capacity gave by framework that yields an obvious result for a performer. A
performing artist portrays any element that collaborates with the system. The performers are
outside the limit of the framework, while the use cases are inside the limit of the framework. On-
screen characters are spoken to with stick figures, use cases with ovals, and the limit of the
framework with a container encasing the use cases.

Fig 5.1.1 Use case diagram end user
Fig 5.1.2 Use case diagram cloud server

5.2 Activity Diagram
Fig 5.2 Activity diagram
The diagram represents an activity flow for a process that involves uploading a file. The process
starts with the selection of a file, which is the first step in the process. Once the file is selected, the
data is encoded and divided into chunks. This is done to make the upload process more efficient
and to ensure that the data is transmitted correctly. After the data is encoded and divided into
chunks, the chunks are uploaded to the server. Once the chunks are uploaded, they are decoded
and saved on the server. This is done to ensure that the data is stored correctly and can be retrieved
later if needed.Finally, the process concludes with the collection of the chunks. This is done to
ensure that all the chunks are present and that the file is complete. Once the chunks are collected,
the file is considered uploaded and the process is complete.

5.3 Data Flow Diagram
The DFD is straightforward graphical formalism that can be utilized to speak to a framework as
far as the info information to the framework, different preparing did on this information and the
yield information created by the framework. A DFD model uses an exceptionally predetermined
number of primitive images to speak to the capacities performed by a framework and the
information stream among the capacities.
The principle motivation behind why the DFD method is so famous is most likely in light of the
way that DFD is an exceptionally basic formalism- It is easy to comprehend and utilization.
Beginning with the arrangement of abnormal state works that a framework performs, a DFD
display progressively speaks to different sub capacities. Actually, any various leveled model is
easy to get it.
The human personality is such that it can without much of a stretch see any progressive model of
a framework in light of the fact that in a various leveled model, beginning with an extremely
straightforward and unique model of framework, distinctive points of interest of a framework are
gradually presented through the diverse orders. A data-flow diagram (DFD) is a graphical
representation of the “stream” of information through a data framework. DFDs can likewise be
utilized for the perception of information handling.
Data flow diagram of store procedure – level 0
Fig 5.3.1 Store procedure-level 0

5.3.1 Encoding Process:
1. Parameter Selection:
Code Size: The desired code size (N, K) is chosen, where N represents the total number of
codewords, and K denotes the number of data symbols. This choice influences the level of
redundancy and the ability to correct errors.
Finite Field: A finite field, such as GF(2^w), is selected for conducting mathematical operations.
This field serves as the foundation for calculations within the algorithm.
2. Cauchy Matrix Construction:
Matrix Size: An N x N Cauchy matrix is generated.
Element Calculation: Each element of the matrix is determined using the formula 1/(xi + yj), where
xi and yi are distinct elements from the chosen finite field. This matrix forms the basis for encoding
and decoding operations.
3. Data Arrangement:
Initial Matrix Rows: The original data symbols are placed as the first K rows of the N x N matrix.
4. Parity Row Generation:

Row Operations: The remaining N – K rows of the matrix are constructed through row operations,
primarily XOR operations. These rows serve as parity information, crucial for error correction.
Cauchy Matrix Rules: The specific operations to generate parity rows are derived from
mathematical properties of the Cauchy matrix.
5. Codeword Transmission:
Transmission: All N codewords, encompassing both data and parity information, are transmitted
over the communication channel.

Data flow diagram for download procedure – level 1
Fig 5.3.2 Download procedure-level 1
5.3.2 Decoding Process:
1. Error Identification:
Missing Codewords: The receiver identifies which codewords have been lost or corrupted during
transmission.
2. Sub-Matrix Construction:
Received Codewords: A sub-matrix is formed using the correctly received codewords.
Known Codewords: If any additional codewords are known a priori, they are also incorporated
into the sub-matrix.
3. Matrix Inversion:
Gaussian Elimination: Techniques such as Gaussian elimination are employed to calculate the
inverse of the sub-matrix.

4. Multiplication and Solving:

Missing Symbols: The inverse matrix is multiplied with a column vector containing zeros,
representing the missing symbols.
Solution Vector: The resulting vector contains values corresponding to the missing symbols.
Additional Decoding: If necessary, decoding algorithms like the Chien search are applied to refine
the values and ensure accuracy.
5. Data Recovery:
Substitution: The recovered symbols are substituted back into the original data matrix to
reconstruct the original data.
Simplified Calculations: The use of Cauchy matrices simplifies calculations compared to
Vandermonde-based RS codes, leading to potential computational efficiency.
5.4 Sequence diagram
Fig 5.4 Sequence diagram

The provided sequence diagram illustrates a series of steps involved in a file uploading and
downloading process. The diagram starts with a user logging in and selecting a file to upload. The
file is then encoded and split into smaller chunks. The chunks are hashed and uploaded to a cloud
storage system. The user can then request the download of the file chunks and save them in their
cloud storage. Finally, the user can download the decoded file. Here's a detailed explanation of the
steps:
Login: The user logs in to the system.
Select File: The user selects a file to upload.
Encode & Split: The file is encoded and split into smaller chunks. This could involve using
techniques like FEC (Forward Error Correction) to distribute the file across multiple chunks,
ensuring that the file can be reconstructed even if some chunks are lost or corrupted
Generate Chunks: The file is divided into smaller chunks, which can be used for more efficient
storage and transmission.
Hash: Each chunk is hashed, which helps ensure the integrity of the data and allows for error
detection in case of corruption.
Upload Chunk Details: The details of each chunk, including its hash, are uploaded to the cloud
storage system.
Download Request: The user requests the download of the file chunks.
Download Chunks: The requested chunks are downloaded from the cloud storage system.
User Cloud Save Chunks: The user saves the downloaded chunks in their cloud storage.
File Decoding: The user decodes the saved chunks, reconstructing the original file

Chapter 6
SYSTEM REQUIREMENTS AND SPECIFICATION
6.1 Hardware Requirements:
i. Processor: i5
ii. RAM: 8 GB
iii. Hard Disk: 400GB
6.2 Software Requirements:
i. Language: JAVA
ii. Operating System: Windows 10/11
iii. Tool: NetBeans IDE
iv. Database: MySQL
v. Cloud : AWS

Chapter 7
EXPECTED OUTCOME OF THE PROPOSED PROJECT
The OCC data recovery strategy outperforms the OWSPM-MSR and PM-MSR data recovery
strategies. Because PM-MSR employs the Vandermonde matrix in the encoding process, the
encoding time is longer than the OCC technique, which uses the Cauchy matrix in conjunction
with the RS code. In the encoding process, the OCC technique employs a Cauchy Good matrix
with fewer ONES, which decreases the time required for the XOR operation. Ultimately it
optimized overall processing time for encoding and decoding. As a result, the OCC technique
outperforms the OWSPM-MSR and PM-MSR strategy.

REFERENCES
i. Snehalata Funde and Gandharba Swain, “Data Recovery Approach with Optimized
Cauchy Coding in Distributed Storage System” International Journal of Advanced
Computer Science and Applications(ijacsa), 13(6), 2022.
http://dx.doi.org/10.14569/IJACSA.2022.0130675
ii. J. Bian, S. Luo, Z. Li and Y. Yang, "Optimal Weakly Secure Minimum Storage
Regenerating Codes Scheme," in IEEE Access, vol. 7, pp. 151120-151130, 2019, doi:
10.1109/ACCESS.2019.2947248.
iii. Wang, X., Zhang, Z., Li, J. et al. An optimized encoding algorithm for systematic polar
codes. J Wireless Com Network 2019, 193 (2019). https://doi.org/10.1186/s13638-019-
1491-4
iv. Kim, J.-J. Erasure-Coding-Based Storage and Recovery for Distributed Exascale Storage
Systems. Appl. Sci. 2021, 11, 3298. https://doi.org/10.3390/app11083298
v. Y. J. Tang and X. Zhang, "Fast En/Decoding of Reed-Solomon Codes for Failure

Recovery," in IEEE Transactions on Computers, vol. 71, no. 3, pp. 724-735, 1 March 2022,
doi: 10.1109/TC.2021.3060701.

Project Phase-1 Report Final

Uploaded by

Copyright:

Available Formats

You might also like

Project Phase-1 Report Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Phase-1 Report Final

Uploaded by

Copyright:

Available Formats

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Santhibastawad Road, Machhe

PROJECT WORK PHASE- 1 (18CSP77) REPORT

Under the Guidance of

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING

Signature of the Guide Signature of the HOD

Mrs. Sudha P R Dr. Rekha P M

The satisfaction and euphoria that accompany the successful completion of

We are thankful to the Project Coordinators Dr. Nagamani N P, Assoc.

Chapter Title Page No

Fig No Figure Name Page No

Dept of ISE, JSSATEB 1

Dept of ISE, JSSATEB 2

Dept of ISE, JSSATEB 3

Dept of ISE, JSSATEB 4

Dept of ISE, JSSATEB 5

Dept of ISE, JSSATEB 6

implications for hardware implementation, making it a valuable contribution to the optimization

Dept of ISE, JSSATEB 7

Dept of ISE, JSSATEB 8

Dept of ISE, JSSATEB 9

[11] The ever-evolving landscape of heterogeneous data generation necessitates an advanced

Dept of ISE, JSSATEB 10

Dept of ISE, JSSATEB 11

Dept of ISE, JSSATEB 12

Dept of ISE, JSSATEB 13

5.1 Use Case Diagram

Dept of ISE, JSSATEB 14

Fig 5.1.1 Use case diagram end user

Fig 5.1.2 Use case diagram cloud server

Dept of ISE, JSSATEB 15

5.2 Activity Diagram

Fig 5.2 Activity diagram

Dept of ISE, JSSATEB 16

5.3 Data Flow Diagram

Data flow diagram of store procedure – level 0

Fig 5.3.1 Store procedure-level 0

Dept of ISE, JSSATEB 17

5.3.1 Encoding Process:

4. Parity Row Generation:

Dept of ISE, JSSATEB 18

Data flow diagram for download procedure – level 1

Fig 5.3.2 Download procedure-level 1

5.3.2 Decoding Process:

Dept of ISE, JSSATEB 19

4. Multiplication and Solving:

5.4 Sequence diagram

Fig 5.4 Sequence diagram

Dept of ISE, JSSATEB 20

Login: The user logs in to the system.

Select File: The user selects a file to upload.

Dept of ISE, JSSATEB 21

SYSTEM REQUIREMENTS AND SPECIFICATION

6.1 Hardware Requirements:

iii. Hard Disk: 400GB

6.2 Software Requirements:

ii. Operating System: Windows 10/11

iii. Tool: NetBeans IDE