DadBig Data Hadoop Interview Questions and Answers

1.
Compare MongoDB and Cassandra
Criteria MongoDB Cassandra
Data Model Document Big Table like
Database scalability Read Write
Querying of data Multi-indexed Using Key or Scan
2. What is Cassandra?
Cassandra is one of the most favored NoSQL distributed database management
systems by Apache. With open source technology, Cassandra is efficiently designed to
store and manage large volumes of data without any failure. Highly scalable for Big
Data models and originally designed by Facebook, Apache Cassandra is written in Java
comprising flexible schemas. Apache Cassandra has no single point of failure.There are
various types of NoSQL databases and Cassandra is a hybrid of column-oriented and
key-value store database. The keyspace is outermost container for an application and
table or column family in Cassandra is keyspace entity.
3. List the benefits of using Cassandra.
Unlike traditional or any other database, Apache Cassandradelivers near real-time
performance simplifying the work of Developers, Administrators, Data Analysts and
Software Engineers.
 Instead of master-slave architecture, Cassandra is established on peer-to-peer

architecture ensuring no failure.
 It also assures phenomenal flexibility as it allows insertion of multiple nodes to
any Cassandra cluster in any datacenter. Further, any client can forward its
request to any server.
 Cassandra facilitates extensible scalability and can be easily scaled up and
scaled down as per the requirements. With a high throughput for read and write
operations, this NoSQL application need not be restarted while scaling.
 Cassandra is also revered for its strong data replication on nodes capability as it
allows data storage at multiple locations enabling users to retrieve data from
another location if one node fails. Users have the option to set up the number of
replicas they want to create.
 Shows brilliant performance when used for massive datasets and thus, the most
preferable NoSQL DB by most organizations.
 Operates on column-oriented structure and thus, quickens and simplifies the
process of slicing. Even data access and retrieval becomes more efficient with
column-based data model.
 Further, Apache Cassandra supports schema-free/schema-optional data model,
which un-necessitate the purpose of showing all the columns required by your
application.Find out how Cassandra Versus MongoDB can help you get ahead
in your career!
Check out this video on Cassandra Tutorial for Beginners
Learn for free ! Subscribe to our youtube Channel.
4. Explain the concept of Tunable Consistency in Cassandra.

Tunable Consistency is a phenomenal characteristic that makes Cassandra a favored
database choice of Developers, Analysts and Big data Architects. Consistency refers to
the up-to-date and synchronized data rows on all their replicas. Cassandra’s Tunable
Consistency allows users to select the consistency level best suited for their use cases.
It supports two consistencies -Eventual and Consistency and Strong Consistency.
The former guarantees consistency when no new updates are made on a given data
item, all accesses return the last updated value eventually. Systems with eventual
consistency are known to have achieved replica convergence.
For Strong consistency, Cassandra supports the following condition:
R + W > N, where
N – Number of replicas
W – Number of nodes that need to agree for a successful write
R – Number of nodes that need to agree for a successful read
5. How does Cassandra write?
Cassandra performs the write function by applying two commits-first it writes to a
commit log on disk and then commits to an in-memory structured known as memtable.
Once the two commits are successful, the write is achieved. Writes are written in the
table structure as SSTable (sorted string table). Cassandra offers speedier write
performance.
6. Define the management tools in Cassandra.
DataStaxOpsCenter: internet-based management and monitoring solution for
Cassandra cluster and DataStax. It is free to download and includes an additional
Edition of OpsCenter
 SPM primarily administers Cassandra metrics and various OS and JVM metrics.
Besides Cassandra, SPM also monitors Hadoop, Spark, Solr, Storm, zookeeper
and other Big Data platforms. The main features of SPM include correlation of
events and metrics, distributed transaction tracing, creating real-time graphs
with zooming, anomaly detection and heartbeat alerting.
Know more about management tools in Cassandra in this Cassandra tutorial.
7. Define memtable.
Similar to table, memtable is in-memory/write-back cache space consisting of content in
key and column format. The data in memtable is sorted by key, and each ColumnFamily
consist of a distinct memtable that retrieves column data via key. It stores the writes
until it is full, and then flushed out.
8. What is SSTable? How is it different from other relational tables?
SSTable expands to ‘Sorted String Table,’ which refers to an important data file in
Cassandra and accepts regular written memtables. They are stored on disk and exist
for each Cassandra table. Exhibiting immutability, SStables do not allow any further
addition and removal of data items once written. For each SSTable, Cassandra creates
three separate files like partition index, partition summary and a bloom filter.
9. Explain the concept of Bloom Filter.
Associated with SSTable, Bloom filter is an off-heap (off the Java heap to native
memory) data structure to check whether there is any data available in the SSTable
before performing any I/O disk operation.Learn more about Apache Cassandra- A Brief
Intro in this insightful blog now!
10. Explain CAP Theorem.
With a strong requirement to scale systems when additional resources are
needed, CAP Theoremplays a major role in maintaining the scaling strategy. It is an
efficient way to handle scaling in distributed systems. Consistency Availability and
Partition tolerance (CAP) theorem states that in distributed systems like Cassandra,
users can enjoy only two out of these three characteristics.
One of them needs to be sacrificed. Consistency guarantees the return of most recent
write for the client, Availability returns a rational response within minimum time and in
Partition Tolerance, the system will continue its operations when network partitions
occur. The two options available are AP and CP.
11. State the differences between a node, a cluster and datacenter in Cassandra.
There are various components of Cassandra. While a node is a single machine running
Cassandra, cluster is a collection of nodes that have similar type of data grouped
together. Data Centers are useful components when serving customers in different
geographical areas. You can group different nodes of a cluster into different data
centers.
12. How to write a query in Cassandra?
Using CQL (Cassandra Query Language).Cqlsh is used for interacting with database.
13. What OS Cassandra supports?
Windows and Linux
14. What is Cassandra Data Model?
Cassandra Data Model consists of four main components:
Cluster: Made up of multiple nodes and keyspaces
Keyspace: a namespace to group multiple column families, especially one per partition
Column: consists of a column name, value and timestamp
ColumnFamily: multiple columns with row key reference.
15. What is CQL?
CQL is Cassandra Query language to access and query the Apache distributed
database. It consists of a CQL parser that incites all the implementation details to the
server. The syntax of CQL is similar to SQL but it does not alter the Cassandra data
model.
16. Explain the concept of compaction in Cassandra.
Compaction refers to a maintenance process in Cassandra , in which, the SSTables are
reorganized for data optimization of data structure son the disk. The compaction
process is useful during interactive with memtable. There are two type sof compaction
in Cassandra:
Minor compaction: started automatically when a new sstable is created. Here,
Cassandra condenses all the equally sized sstables into one.
Major compaction is triggered manually using nodetool. Compacts all sstables of a
ColumnFamily into one.
17. Does Cassandra support ACID transactions?
Unlike relational databases, Cassandra does not support ACID transactions.
18. Explain Cqlsh
Cqlsh expands to Cassandra Query language Shell that configures the CQL interactive
terminal. It is a Python-base command-line prompt used on Linux or Windows and
exequte CQL commands like ASSUME, CAPTURE, CONSITENCY, COPY, DESCRIBE
and many others. With cqlsh, users can define a schema, insert data and execute a
query.
19. What is SuperColumn in Cassandra?
Cassandra Super Column is a unique element consisting of similar collections of data.
They are actually key-value pairs with values as columns. It is a sorted array of
columns, and they follow a hierarchy when in action: keystore> column family> super
column> column data structure in JSON.
Similar to row keys, super column data entries contains no independent values but are
used to collect other columns. It is interesting to note that super column keys appearing
in different rows do not necessarily match and will not ever.
20. Define the consistency levels for read operations in Cassandra.
 ALL: Highly consistent. A write must be written to commitlog and memtable on

all replica nodes in the cluster
 EACH_QUORUM: A write must be written to commitlog and memtable on
quorum of replica nodes in all data centers.
 LOCAL_QUORUM:A write must be written to commitlog and memtable on
quorum of replica nodes in the same center.
 ONE: A write must be written to commitlog and memtableof at least one replica
node.
 TWO, Three: Same as One but at least two and three replica nodes,
respectively
 LOCAL_ONE: A write must be written for at least one replica node in the local
data center
 ANY
 SERIAL: Linearizable Consistency to prevent unconditional update
 LOCAL_SERIAL: Same as Serial but restricted to local data center
21. What is difference between Column and Super Column?
Both elements work on the principle of tuple having name and value. However, the
former‘s value is a string while the value in latter is a Map of Columns with different data
types.
Unlike Columns, Super Columns do not contain the third component of timestamp.
22. What is ColumnFamily?
As the name suggests, ColumnFamily refers to a structure having infinite number of
rows. That are referred by a key-value pair, where key is the name of the column and
value represents the column data. It is much similar to a hashmap in java or dictionary
in Python. Rememeber, the rows are not limited to a predefined list of Columns here.
Also, the ColumnFamily is absolutely flexible with one row having 100 Columns while
the other only 2 columns.
23. Define the use of Source Command in Cassandra.
Source command is used to execute a file consisting of CQL statements.
24. What is Thrift?
Thrift is a legacy RPC protocol or API unified with a code generation tool for CQL. The
purpose of using Thrift in Cassandra is to facilitate access to the DB across the
programming language.
25. Explain Tombstone in Cassandra.
Tombstone is row marker indicating a column deletion. These marked columns are
deleted during compaction. Tombstones are of great significance as Cassnadra
supports eventual consistency, where the data must respond before any successful
operation.
26. What Platforms Cassandra runs on?
Since Cassandra Online Training is a Java application, it can successfully run on any
Java-driven platform or Java Runtime Environment (JRE) or Java Virtual Machine
(JVM). Cassandra also runs on RedHat, CentOS, Debian and Ubuntu Linux platforms.
Interested in learning Cassandra? Click here to learn more in this Cassandra
training!
27. Name the ports Cassandra uses.
The default settings state that Cassandra uses 7000 ports for Cluster Management,
9160 for Thrift Clients, 8080 for JMX. These are all TCP ports and can be edited in the
configuration file: bin/Cassandra.in.sh
28. Can you add or remove Column Families in a working Cluster?
Yes, but keeping in mind the following processes.
 Do not forget to clear the commitlog with ‘nodetool drain’

 Turn off Cassandra to check that there is no data left in commitlog
 Delete the sstable files for the removed CFs
29. What is Replication Factor in Cassandra?

ReplicationFactor is the measure of number of data copies existing. It is important to
increase the replication factor to log into the cluster.
30. Can we change Replication Factor on a live cluster?
Yes, but it will require running repair to alter the replica count of existing data.
31. How to iterate all rows in ColumnFamily?
Using get_range_slices. You can start iteration with the empty string and after each
iteration, the last key read serves as the start key for next iteration.

DadBig Data Hadoop Interview Questions and Answers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DadBig Data Hadoop Interview Questions and Answers

Uploaded by

Copyright:

Available Formats

1.

Compare MongoDB and Cassandra

Criteria MongoDB Cassandra

Data Model Document Big Table like

Database scalability Read Write

Querying of data Multi-indexed Using Key or Scan

 Instead of master-slave architecture, Cassandra is established on peer-to-peer

Check out this video on Cassandra Tutorial for Beginners

Learn for free ! Subscribe to our youtube Channel.

4. Explain the concept of Tunable Consistency in Cassandra.

Know more about management tools in Cassandra in this Cassandra tutorial.

 ALL: Highly consistent. A write must be written to commitlog and memtable on

 Do not forget to clear the commitlog with ‘nodetool drain’

29. What is Replication Factor in Cassandra?

You might also like