Professional Documents
Culture Documents
Apache Cassandra Report
Apache Cassandra Report
Apache Cassandra Report
SEMINAR REPORT ON
APACHE CASSANDRA
Submitted in partial fulfilment of 5th Semester
MASTER OF COMPUTER APPLICATIONS
of
Visvesvaraya Technological University
CHETHAN GOWDA
1AY18MCA62
Under the Guidance of
Prof. ANITHA K. L.
CERTIFICATE
This is to certify that the seminar entitled
APACHE CASSANDRA
Submitted in the partial fulfilment of requirement of the 5th semester
of
Master of Computer Applications
is a result of the bonafide work carried out by
CHETHAN GOWDA
1AY18MCA62
During the academic year 2019-2020
TABLE OF CONTENT
Sl Contents PAGE
No NO
1 Introduction 1
2 History of Cassandra 3
3 NoSQL database 4
4 Cassandra architecture 5
5 Features of Cassandra 6
8 Working 11
9 Conclusion 13
11 References 15
ABSTRACT
Biometric ATM using Iris recognition discusses the use of the iris-based biometric recognition.
Biometric recognition is the automated recognition of individuals based on the physiological and
behavioural characteristics. The recognition can be positive or negative. It highlights the key areas where
the iris biometric method has been used successfully, and what are shortfalls. It presents an overview of the
algorithm used in Iris biometric method with the other biometric methods in terms of cost-effectiveness,
usability, speed and other factors. The iris is very unique in that it has many features such as crypts, furrows
and collarettes, which are used by the algorithms for comparison between a template and an image acquired
for recognition.
Most of the algorithms used for iris recognition have a very low false acceptance rate compared to other
biometric methods, and these algorithms can do millions of comparisons on easily available hardware.
1. INTRODUCTION
Cassandra is a distributed storage system for managing structured data that is designed to
scale to a very large size across many commodity servers, with no single point of failure.
The idea is to run on top of an infrastructure of hundreds of nodes, where small and
large components in the data centers fail continuously. Over the edge, Cassandra achieves
scalability, high performance, high availability and applicability. It does not support a full
relational data model. Instead it provides clients with a simple data model as explained later.
Many modern businesses have outgrown the typical RDBMS use case and are in need
of data management software that offers more. Sharing was a stop-gap measure, but
architectural limitations, and the management complexity it requires, make it unacceptable for
many mainstream organizations.
Figure1-Cassandra logo
Apache Cassandra
Apache Cassandra is an open source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneable consistent, column-oriented database that bases its
distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at
Facebook, it is now used at some of the most popular sites on the Web.” Here we see a lot of
complicated words such as distributed, decentralized, elastically scalable, highly available,
fault-tolerant, tuneable consistent, column-oriented etc.
2. HISTORY OF CASSANDRA
A highly-available and scalable storage system used by Amazon to store and retrieve
user shopping charts and other core services.
How it works?
Allows read and write operations to continue even during network partitions and
resolve update conflicts using different conflict resolution mechanisms.
Allows customization to meet desired preference.
Consist hashing, vector clocks.
A high-performance data storage system built on google file system and other google
technologies.
How it works?
Provides both structure and data distribution but relies on a distributed file system for
durability.
Richer data model from Dynamo. One key many values. Fast sequential access.
SSTableStorage,Mem-table,compaction,Append-only.
3. NOSQL DATABASE
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational.
simplicity of design,
horizontal scaling, and
finer control over availability.
The following table lists the points that differentiate a relational database from a NoSQL
database.
Relational databases are used to handle NoSQL databases can handle big data or data
moderate volume of data. in a very high volume .
4. CASSANDRA ARCHITECTURE
Cassandra can satisfy many data-driven application use cases through a carefully
thought-out architecture designed to manage all forms of modern data, scale to meet the
requirements of “big data” management, offer linear performance scale-out capabilities, and
deliver the type of high availability that most every online, 24x7 application needs. At its
foundation, Cassandra is a peer-to-peer distributed data management system where every node
is essentially the same with respect to how it functions in the cluster. In Cassandra, there is no
concept of a “master node” or anything similar, with the benefit being derived that no single
point of failure exists for any key process or function.
Figure2-Cassandra table
Cassandra can satisfy many data-driven application use cases through a carefully
thought-out architecture designed to manage all forms of modern data, scale to meet the
requirements of “big data” management, offer linear performance scale-out capabilities.
5. FEATURES OF CASSANDRA
There are a lot of outstanding technical features which makes Cassandra very popular.
High Scalability
Cassandra is highly scalable which facilitates you to add more hardware to attach more
customers and more data as per requirement.
Rigid Architecture
Cassandra has not a single point of failure and it is continuously available for business-
critical applications that cannot afford a failure.
Fast Linear-scale Performance
Cassandra is linearly scalable. It increases your throughput because it facilitates you to
increase the number of nodes in the cluster. Therefore, it maintains a quick response time.
Fault tolerant
Cassandra is fault tolerant. Suppose, there are 4 nodes in a cluster, here each node has
a copy of same data. If one node is no longer serving then other three nodes can served as
per request.
Flexible Data Storage
Cassandra supports all possible data formats like structured, semi-structured, and
unstructured. It facilitates you to make changes to your data structures according to your
need.
Easy Data Distribution
Data distribution in Cassandra is very easy because it provides the flexibility to
distribute data where you need by replicating data across multiple data centers.
Transaction Support
Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability
(ACID).
Fast writes
Cassandra was designed to run on cheap commodity hardware. It performs blazingly
fast writes and can store hundreds of terabytes of data, without sacrificing the read
efficiency.
In Cassandra, nodes in a cluster act as replicas for a given piece of data. If some of the
nodes are responded with an out-of-date value, Cassandra will return the most recent value to
the client. After returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
Figure3-Cassandra Replication
Components of Cassandra
Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write
operation is written to the commit log.
Mem-table:Amem-table is a memory-resident data structure. After commit log, the data will
be written to the mem-table. Sometimes, for a single-column family, there will be multiple
mem-tables.
SSTable: It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an
element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every
query.
Cassandra Query Language (CQL) is used to access Cassandra through its nodes. CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.
The client can approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.
cqlsh
This command is used to start the cqlsh prompt. In addition, it supports a few more
options as well. The following table explains all the options of cqlsh and their usage.
Options Usage
cqlsh --help Shows help topics about the options of cqlsh commands.
cqlsh --version Provides the version of the cqlsh you are using.
cqlsh --execute Directs the shell to accept and execute a CQL command.
cql_statement
cqlsh --file= “file name” If you use this option, Cassandra executes the command in the given file
and exits.
cqlsh -u “user name” Using this option, you can authenticate a user. The default user name is:
cassandra.
cqlsh-p “pass word” Using this option, you can authenticate a user with a password. The
default password is: cassandra.
CONSISTENCY − Shows the current consistency level, or sets a new consistency level.
SHOW − Displays the details of current cqlsh session such as Cassandra version, host, or
data type assumptions.
CQL Clauses
SELECT − This clause reads data from a table
WHERE − The where clause is used along with select to read a specific data.
ORDERBY − The order by clause is used along with select to read a specific data in a
specific order.
8. WORKING
Read Operations
In Read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable which contains the required data.
There are three types of read request that is sent to replicas by coordinators.
Direct request
Digest request
Read repair request
The coordinator sends direct request to one of the replicas. After that, the coordinator
sends the digest request to the number of replicas specified by the consistency level and
checks if the returned data is an updated data.
After that, the coordinator sends digest request to all the remaining replicas. If any
node gives out of date value, a background read repair request will update that data. This
process is called read repair mechanism.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later
the data will be captured and stored in the mem-table. Whenever the mem-table is full, data
will be written into the SStable data file.
All writes are automatically partitioned and replicated throughout the cluster. Cassandra
periodically consolidates the SSTables, discarding unnecessary data.
Figure5-Write Operations
9. CONCLUSION
Apache Cassandra is entirely suited to large-scale applications that need to access huge
volumes of unstructured data. That being said, Cassandra is still a good choice for smaller
applications, as it delivers a high level of data protection out of the box.
Developing for Cassandra is very simple, as most of the truly clever aspects of this
technology are handled transparently, so developers have no need to develop platform
specific code. This makes Cassandra easy to implement, as developers do not have to be
brought up to speed to start creating applications.
Six ways that Cassandra delivers a powerful foundation for future multi-cloud:
Topology-aware availability
Tunable consistency
Remote regional awareness
Flexible global and local consistency
Simple and effective replication
Open source licensing with needed flexibility
Some technical plagues have to be fixed as well. These issues are native to Cassandra and
can’t be tolerated in situations where performance predictability is critical:
Even with these shortcomings Cassandra has all the chances to become one of the most
widely adopted NoSQL solutions and a standard for a scalable highly available storage. Great
things about Cassandra include:
11. REFERENCES
1. https://en.wikipedia.org/wiki/Apache_Cassandra
2. http://cassandra.apache.org/
3. https://www.tutorialspoint.com/cassandra/cassandra_introduction.htm
4. https://www.datastax.com/
5. https://www.javatpoint.com/cassandra-tutorial