Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

1.

FM Algorithm
Suppose stream elements are chosen from some universal set. We would like to know how
many different elements have appeared in the stream, counting either from the beginning of the
stream or from some known time in the past.

Flajolet-Martin Algorithm is used for estimating the number of unique elements in a stream in a
single pass. The time Complexity of this algorithm is O(n) and the space complexity is O(log m)
where n is the number of elements in the stream and m is the number of unique elements. Its
idea is that the more different elements we see in the stream, the more different hash-values
we shall see. As we see more different hash-values, it becomes more likely that one of these
values will be “unusual.” The particular unusual property we shall exploit is that the value ends
in many 0’s, although many other options exist.

The major components of this algorithm are :


o A collection of hash functions, and
o A bit-string of length L such that n. A 64-bit string is sufficient for most cases.

Each incoming element will be hashed using all the hash functions. Higher the number Of
distinct elements in the stream, higher will be the number of different hash values.

On applying a hash function h on an element of the stream e, the hash value h(e) is produced.
We convert h(e) into an equivalent binary bit-string. This bit string will end in some number of
zeroes. For instance, the 5-bit string 11010 with 1 zero and 10001 ends with no zeroes.

This count of zeroes is known as the tail length. If R denotes the maximum tail length of any
element e encountered thus far in the stream, then the estimate for the number of unique
elements in the stream is 2R.

Now to see that this estimate makes sense we have to use the following arguments using
probability theory:

- If m » 2~ the probability of finding a tail of length at least r approaches 1 .


- If m « 2', the probability of finding a tail of length at least r approaches o.
So, we can conclude that the estimate of 2R is neither going to be too low nor too high.
- Let us now understand the working of the algorithm with an example :
Stream: 5, 3, 9, 2, 7, 11
2. Hadoop and Hadoop Ecosystem with diagram and explain its components
Hadoop ecosystem is a platform or framework which helps in solving the big data problems. It
comprises different components and services ( ingesting, storing, analyzing, and maintaining)
inside of it. The main four core components of Hadoop which include HDFS, YARN, MapReduce
and Spark Zookeeper.

Components of Hadoop
● HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data
● Yet Another Resource Negotiator(YARN), as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
● By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
○ Map() performs sorting and filtering of data and thereby organizing them in the
form of group.
○ Reduce(), as the name suggests does the summarization by aggregating the
mapped data.
● PIG: Pig was basically developed by Yahoo which works on a pig Latin language, which
is a Query based language similar to SQL.
● HIVE: With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
● Mahout: Mahout, allows Machine Learnability to a system or application. It provides
various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
● Apache Spark: It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
● Apache HBase: It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
● Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency,
often. Zookeeper overcame all the problems by performing synchronization,
inter-component based communication, grouping, and maintenance.
● Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when
some data or external stimulus is given to it.

Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models.
● It can handle arbitrary text and binary data.
● So Hadoop can digest any unstructured data easily.
● We saw in traditional approach having separate storage and processing clusters is not
the best fit for big data. Hadoop clusters, however, provide storage and distributed
computing all in one.
Limitations of Hadoop
● Issue with Small Files
● Slow Processing Speed
● Support for Batch Processing only
● No Real-time Data Processing
3. HDFS with diagram
● The Hadoop Distributed File System (HDFS) is designed to provide a fault-tolerant file
system designed to run on commodity hardware.
● The primary objective of HDFS is to store data reliably even in the presence of failures
including Name Node failures, Data Node failures and network partitions.
● HDFS uses a master/slave architecture in which one device (the master) controls one or
more other devices (the slaves).
● The HDFS cluster consists of a single Name Node and a master server manages the file
system namespace and regulates access to files.

Name Node
● It is the master of HDFS (Hadoop file system).
● Contains Job Tracker, which keeps tracks of a file distributed to different data nodes.
● Failure of Name Node will lead to the failure of the full Hadoop system.

Data node
● Data node is the slave of HDFS.
● A data node can communicate with each other through the name node to avoid replication in
the provided task.
● Data nodes update the change to the data node.

Job Tracker
● Determines which file to process.
● There can be only one job tracker per Hadoop cluster.
● Only single task tracker is present per slave node.
● Performs tasks given by job tracker and also continuously communicates with the job Tracker.

SSN (Secondary Name Node)


● Its main purpose is to monitor.
● One SSN is present per cluster
4. Map Reduce to multiply 2 matrix
Let me examine the algorithms on an example to explain the algorithms better. Suppose we
have two matrices, M, 2×3 matrix, and N, 3×2 matrix as follows:

Map:
For matrix M, map task (Algorithm 1) will produce key,value pairs as follows:

Result of map

Reduce:
Reduce task takes the key,value pairs as the input and processes one key at a time. For each
key it divides the values in two separate lists for M and N. For key (1,1), the value is the list of

Reduce task sorts values begin with M in one list and values begin with N in another list as
follows:

then sums up the multiplication of m_{ij} and n_{jk} for each j as follows:
P(1,1) = 1a+2c+3e
The same computation applied to all input entries of reduce task.

for all i and k is then calculated as follows:

The product matrix P of MN is then generated as:

5. Algebraic Operations in Map Reduce - Union, Intersection, Joins

Relational-Algebra Operations
1. R, S - relation
2. t, t’ - a tuple
3. C - a condition of selection
4. A, B, C - subset of attributes
5. a, b, c - attribute values for a given subset of attributes

Selection
● Map: For each tuple t in R, test if it satisfies C. If so, produce the key-value pair (t, t).
That is, both the key and value are t.
● Reduce: The Reduce function is the identity. It simply passes each key-value pair to the
output.
Projection
● Map: For each tuple t in R, construct a tuple t’ by eliminating from t those components
whose attributes are not in A. Output the key-value pair (t’, t’).
● Reduce: For each key t’ produced by any of the Map tasks, there will be one or more
key-value pairs (t’,t’).

Union
● Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t).
● Reduce: Associated with each key t there will be either one or two values. Produce
output (t, t) in either case.

Intersection
● Map: Turn each input tuple t either from relation R or S into a key-value pair (t, t).
● Reduce: If key t has value list [t, t], then produce (t,t). Otherwise, produce nothing.
6. Data Stream Management System with diagram and components
It is similar to a conventional relational database management system (RDBMS)

1. Input streams : The input streams have the following characteristics:_


● There can be one or more number of input streams entering the system.
● The streams can have different data types
● The rate of data flow of each stream may be different.
● Within a stream the time interval between the arrival of data items may differ. For
example, suppose the second item arrives after 2 ms from the arrival of the first data
item, then it is not necessary that the third data item arrive after 2 ms from the arrival of
the second data item. It may arrive earlier or even later.
2. Stream processor: : All types of processing such as sampling, cleaning, filtering, and
querying on the input stream data are done here. Two types of queries are supported which are
standing queries and ad-hoc queries.
3. Working: A limited memory such as a disk or main memory is used as the working storage
for storing parts of summaries stories so that queries can be executed. If faster processing is
needed, main memory is used, otherwise a secondary storage disk is used. As the working
storage is limited in size, it is not possible to store all the data received from all the streams.
4. Archival: : The archival store is a large· storage area· in which the streams may be archived
but execution of queries directly on the archival store is not supported. Also, the fetching of data
from this store takes a lot of time as The archival store is a large· storage area· in which the
streams may be archived but execution of compared to the fetching of data from the working
store.
5. Outpμt streams : The output consists of the fully processed streams and the results of the
execution of queries on the streams.

The difference between a conventional database-management system and a data-stream


management system is that in case of the database-management system all of the data is
available on the disk and the system can control the rate of data reads. On the other hand, in
case of the data-stream-management system the rate of arrival of data is not in the control of
the system and the system has to take care of the possibilities of data getting lost and take the
necessary precautionary measures.
7. Recommendation System with types and how classification helps in recommendation
system
● It is vast widely used now-a-days. It is likely a subclass of information filtering system. It
is used to give recommendations for books, games, news, movies, music, research
articles, social tags etc.
● It is also,useful for experts, finan~ial services, life insurance, and social media like
Twitter etc.
● Collaborative filtering and content-based filtering are the two approach used by
recommendation system.
● Collaborative filtering uses user's past behaviour and apply some predication about.user
may like and accordingly post data.
● Content based filtering uses user's similar properties of data preferred by user.
● By using collaborative filtering and . content based filtering a· combine approach is
developed i.e. Hybrid recommendation system.

Content Based filtering


Recommend items to customer x similar to previous items rated highly by x
• A user is likely to have similar level of interest for similar items
• System learns importance of item features, and builds a model of what user likes

Item Profiles
• For each item, create an item profile
• Profile is a set (vector) of features
• Movies: author, title, actor, director,...
• Text: Set of “important” words in document
• How to pick important features?
• Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency)
• Term ... Feature
• Document ... Item
TF-IDF (Term frequency * Inverse Doc Frequency)
fij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term i


N = total number of docs

TF-IDF score: wij = TFij × IDFi

Doc profile = set of words with highest TF-IDF scores,

User Profiles and Prediction


• User profile possibilities:
• Weighted average of rated item profiles
• Variation: weight by difference from average rating for item
• Prediction heuristic:
• Given user profile x and item profile i, estimate

Obtaining features from content


- We can get the number of features of the same items by tag items by entering phrases or
searching the value range for price, description, tags, etc.
- By keeping tag option available, users can search at item on feature of tag value ·like price
range for books, color shade for cloth etc in online shopping.
- One problem in tagging is to create tags and such enough tag awareness at user level

Representing in profile
Some features are numerical, for instance we might take the rating for mobile applications
based on Android OS rating as a real number like 1, 2, 3, 4 and 5 stars by some users of it.
Keeping options of only one component like 1 star or 5 stars will not make sense to get average
rating of. By keeping 5 options in rating a good possible average for rating can be observed
without structure implicit in numbers Numerical value based rating provides a single component
of vector representing items and other integer valued or parameter becomes Numerical features

User profile
Vectors are useful to describe items and user's preferences. Users and items relation can be
plotted with the help of utility matrix.
Example : Consider similar case like before but utility matrix has some nonblank entries that are
rating in 1- 5 range. Consider, user U gives responses with average rate of 3 there are three
applications (Android , 05 based games) got ratings of 3, 4 and 5. Then user profile of u, the
component for application will have value i.e. rated average of 3-3, 4-3 and 5-3 i.e. value of 1
On other hand, user v gives average rating 4. So user v responses to application are 3, 5 and 2.
The user profile for v has in the component for application; the average of 3-4, 5-4 and 2-4, i.e.
value - 2/3.

Recommendation
Between user's vector and item's vector cosine distance Fan be computed with help of profile
vectors for users and items both.
It is helpful to estimate degree to which user will prefer as an item (i.e. prediction for
recommendation).
If the user's and response (like 1 to 5 scale for mobile apps) vectors, the cosine angle is a large
positive fraction. It means angle is close to O and hence there is very small cosine distance
between vectors.
If user's and responses the vector cosine angle is a large negative fraction. It means angle is
close to degree of 180 which is a maximum possible cosine distance.
Cosine similarity function = for measuring cosine angle between two vector

Where,
TF* IDF weight of term 't' in document
TF: Term Frequency
IDF : inverse document frequency

Classification Algorithm
It is used to know the user's interest. By applying some function to new item we get some
probability which user may like Numeric values and also help to know about the degree of
interest with some particular item.
Few of techniques are listed as follows :
(1) Decision Tree and Rule Induction
(2) Nearest Neighbour Method
(3) Relevance feedback and Rocchio
(4) Linear classification
(5) probabilistic methods
(6) Naive Bayes.
Collaboration-based recommendation.
To recommend among large sets of values to users is very important. Recommendation must be
appreciated by user else effort taken for it were worthless.

Toolbox used for collaborative filtering becomes advanced with the help of bayesian interface
case-based reasoning method, information retrieval.
Collaborating filtering deals with users and items. A preference is given by the user to an item is
known as rating and is represented by triplet set of values (User, item, rating)
Predict task tells about preference may given by a user or what user's likely preference to an
item?
Recommend task helpful to design n- items host for user's need. These n-items are not on basis
of prediction preference because criteria to create recommendations may be different.

Measuring Similarity
Among values of utility matrix is really a big question to measure similarity of items of users.

Jaccard Distance
In this sets of items rated are considered while values in matrix are ignored.

Alternatively Jacard distance can be given by,

A t:. B = (Au B) - (An B)

Cosine Distance
,· • If user doesn't give any rating from 1 to s to any app then it is considered as a O (zero)
8. Short note on Social Graph
A social graph is a diagram that illustrates interconnections among people, groups and
organizations in a social network. The term is also used to describe an individual's social
network. When portrayed as a map, a social graph appears as a set of network nodes that are
connected by lines
● In general a graph is a collection of set cf edges (e) and set of vertices (V). If there is an
edge exists between any-two nodes of a graph then that node relates with each other.
● Graphs are categories by many parameters like ordered pairs of nodes, unordered pairs
of nodes.
● Some edges have direction, weight. Relationship among graph is explained with help of
an adjacency matrix.
● Small networks can be easily managed to construct a graph, it is quite impossible with a
huge/wide network.
● Summary statistics and performance metrics are useful for design of graph for a large
network.
● Networks and graphs can be elaborate with the help of few parameters like diameter i.e.
largest distance between any two nodes, centrality degree distribution.
● Social website like Facebook uses undirected social graph for friends while directed
graph used in social website like Twitter, Google+ (plus). Twitter gives connection like
1st, 2nd , 3rd and Google classify linked connection in friends, family, Acquaintances,
Following etc.

Parameters used in Social Graph


The node is distinct in a network and it is part of a graph by a set of links. Some general
parameter of social network as graph are :
(a) Degree : Number of adjacent nodes (considering both out degree and in-degree).
Degree of node n1 denoted by d(nj).
(b) Geodesic Distance : Actual distance between two nodes ni and nj expressed by d (i; j).
(c) Density: It gives correctness of a graph, it is useful to count the closeness of a network.
(d) Centrality : It tells about degree centrality i.e. nodes' appearance in the center of
network centrality has types.

9. When it comes to big data how NoSQL scores over RDBMS [OR] Explain different ways
by which big data problems are handled by NoSQL.
NoSQL databases offer many benefits over relational databases. NoSQL databases have
flexible data models, scale horizontally, have incredibly fast queries, and are easy for
developers to work with.

● Flexible data models: NoSQL databases typically have very flexible schemas. A flexible
schema allows you to easily make changes to your database as requirements change.
You can iterate quickly and continuously integrate new application features to provide
value to your users faster.

● Horizontal scaling: Most SQL databases require you to scale-up vertically (migrate to a
larger, more expensive server) when you exceed the capacity requirements of your
current server. Conversely, most NoSQL databases allow you to scale-out horizontally,
meaning you can add cheaper commodity servers whenever you need to.

● Fast queries: Queries in NoSQL databases can be faster than SQL databases. Why?
Data in SQL databases is typically normalized, so queries for a single object or entity
require you to join data from multiple tables. As your tables grow in size, the joins can
become expensive. However, data in NoSQL databases is typically stored in a way that
is optimized for queries. The rule of thumb when you use MongoDB is data that is
accessed together should be stored together. Queries typically do not require joins, so
the queries are very fast.

● Easy for developers: Some NoSQL databases like MongoDB map their data structures
to those of popular programming languages. This mapping allows developers to store
their data in the same way that they use it in their application code. While it may seem
like a trivial advantage, this mapping can allow developers to write less code, leading to
faster development time and fewer bugs.
10. Traditional vs Big Data

11. How recommendation is done based on properties of product? Explain with example
Refer question 7
12. What is the role of JobTracker and TaskTracker in MapReduce.Illustrate Map Reduce
with Word Count Example
Job Tracker :-
● Job tracker is a daemon that runs on a namenode for submitting and tracking
MapReduce jobs in Hadoop.
● It assigns the tasks to the different task tracker.
● In a Hadoop cluster, there will be only one job tracker but many task trackers.
● It is the single point of failure for Hadoop and MapReduce Service.
● If the job tracker goes down all the running jobs are halted.
● It receives heartbeat from task tracker based on which Job tracker decides whether the
assigned task is completed or not.

Task Tracker :-
● Task tracker is also a daemon that runs on datanodes.
● Task Trackers manage the execution of individual tasks on slave node.
● When a client submits a job, the job tracker will initialize the job and divide the work and
assign them to different task trackers to perform MapReduce tasks.
● While performing this action, the task tracker will be simultaneously communicating with
job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task
tracker within specified time, then it will assume that task tracker has crashed and assign
that task to another task tracker in the cluster.

WordCount Example
13. How big data problems are handled by Hadoop system./ Advantages of Hadoop
System/ Features of Hadoop system
Advantages of Hadoop
● It can handle arbitrary text and binary data.
● So Hadoop can digest any unstructured data easily.
● We saw in traditional approach having separate storage and processing clusters is not
the best fit for big data.
● Hadoop clusters, however, provide storage and distributed computing all in one.

14. Explain how Hadoop goals are covered in hadoop distributed file system.
Features of HDFS:
● Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
● Replication - Due to some unfavorable conditions, the node containing the data may be loss.
So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
● Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
● Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
● Portable - HDFS is designed in such a way that it can easily portable from platform to another.

Goals of HDFS:
● Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.
● Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
● Coherence Model - The application that runs on HDFS require to follow the
write-once-read-many approach. So, a file once created need not to be changed. However, it
can be appended and truncate
15. Explain DGIM algorithm for counting ones in stream with example.
The DGIM Algorithm (Datar - Gionis - lndyk - Motwani) The simplest case of the algorithm uses
O(log2 N) bits to represent a window of N bits, and allows us to estimate the number of 1’s in
the window with an error of no more than 50%. Maximum number of errors in estimation of the
number 1’s in the window is 50%
The two basic components of this algorithm are :
o Timestamps: The sequence nümber of in the order of arrival
o Buckets.
Each bit arriving in the stream is assigned a timestamp. So, the first bit is assigned timestamp 1,
the second bit is assigned timestamp 2 and each subsequent bit is assigned a timestamp one
higher than the previous bit.

We divide the window into buckets, consisting of:


1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the
number of 1’s as the size of the bucket.

There are six rules that must be followed when representing a stream by buckets.
● The right end of a bucket is always a position with a 1.
● Every position with a 1 is in some bucket.
● No position is in more than one bucket.
● There are one or two buckets of any given size, up to some maximum size.
● All sizes must be a power of 2.
● Buckets cannot decrease in size as we move to the left (back in time).

Example of counting ones


16. Explain the different architectural patterns in NoSQL with their applications?
Key-Value Store Database: This model is one of the most basic models of NoSQL databases.
As the name suggests, the data is stored in form of Key-Value Pairs. The key is usually a
sequence of strings, integers or characters but can also be a more advanced data type. The
value is typically linked or co-related to the key. The key-value pair storage databases generally
store data as a hash table where each key is unique. The value can be of any type (JSON,
BLOB(Binary Large Object), strings, etc). This type of pattern is usually used in shopping
websites or e-commerce applications.
○ Advantages:
■ Can handle large amounts of data and heavy load,
■ Easy retrieval of data by keys.
● Limitations:
○ Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
○ Data can be involving many-to-many relationships which may collide.
Examples: DynamoDB

Column Store Database: Rather than storing data in relational tuples, the data is stored in
individual cells which are further grouped into columns. Column-oriented databases work only
on columns. They store large amounts of data into columns together. Format and titles of the
columns can diverge from one row to other. Every column is treated separately. But still, each
individual column may contain multiple other columns like traditional databases. Basically,
columns are mode of storage in this type.
Advantages: Data is readily available
Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
HBase, Bigtable by Google, Cassandra
Document Database: The document database fetches and accumulates data in form of
key-value pairs but here, the values are called as Documents. Document can be stated as a
complex data structure. Document here can be a form of text, arrays, strings, JSON, XML or
any such format. The use of nested documents is also very common. It is very effective as most
of the data created is usually in form of JSONs and is unstructured.
Advantages:
● This type of format is very useful and apt for semi-structured data.
● Storage retrieval and managing of documents is easy.
Limitations:
● Handling multiple documents is challenging
● Aggregation operations may not work accurately.
Examples: MongoDB

Graph Databases: Clearly, this architecture pattern deals with the storage and management of
data in graphs. Graphs are basically structures that depict connections between two or more
objects in some data. The objects or entities are called as nodes and are joined together by
relationships called Edges. Each edge has a unique identifier. Each node serves as a point of
contact for the graph. This pattern is very commonly used in social networks where there are a
large number of entities and each entity has one or many characteristics which are connected
by edges. The relational database pattern has tables that are loosely connected, whereas
graphs are often very strong and rigid in nature.
Advantages:
● Fastest traversal because of connections.
● Spatial data can be easily handled.
Limitations: Wrong connections may lead to infinite loops.
Examples: Neo4J

17. Explain Girvan-Newman algorithm to mine Social Graphs.


Divisive method: starts with the full graph and breaks it up to find communities.
Too slow for many large networks (unless they are very sparse), and it tends to give relatively
poor results for dense networks.
The edge betweenness for edge e:

Where σst is total number of shortest paths from node s to node t and σst(e) is the number of
those paths that pass through e. The summation is n(n − 1) pairs for directed graphs and n(n −
1)/2 for undirected graphs.

Steps
Step 1. Find the edge of highest betweenness - or multiple edges of highest betweenness, if
there is a tie - and remove these edges from the graph. This may cause the graph to separate
into multiple components. If so, this is the first level of regions in the partitioning of the graph.

Step 2. Recalculate all betweennesses, and again remove the edge or edges of highest
betweenness. This may break some of the existing components into smaller components; if so,
these are regions nested within the larger regions.

Step 3 Proceed in this way as long as edges remain in graph, in each step recalculating all
betweennesses and removing the edge or edges of highest betweenness.

18. Challenges of queries in data stream


Unbounded Memory Requirements: Since data streams are potentially unbounded in size,
the amount of storage required to compute an exact answer to a data stream query may also
grow without bound. While external memory algorithms for handling data sets larger than main
memory have been studied, such algorithms are not well suited to data stream applications
since they do not support continuous queries and are typically too slow for real-time response.
New data is constantly arriving even as the old data is being processed; the amount of
computation time per data element must be low, or else the latency of the computation will be
too high and the algorithm will not be able to keep pace with the data stream.
Approximate Query Answering: When we are limited to a bounded amount of memory it is not
always possible to produce exact answers for data stream queries; however, high-quality
approximate answers are often acceptable in lieu of exact answers. Sliding Window: One
technique for producing an approximate answer to a data stream query is to evaluate the query
not over the entire past history of the data streams, but rather only over sliding windows of
recent data from the streams. For example, only data from the last week could be considered in
producing query answers, with data older than one week being discarded.

Blocking Operators: A blocking query operator is a query operator that is unable to produce
the first tuple of its output until it has seen its entire input. If one thinks about evaluating
continuous stream queries using a traditional tree of query operators, where data streams enter
at the leaves and final query answers are produced at the root, then the incorporation of
blocking operators into the query tree poses problems. Since continuous data streams may be
infinite, a blocking operator that has a data stream as one of its inputs will never see its entire
input, and therefore it will never be able to produce any output. Doing away with blocking
operators altogether would be problematic, but dealing with them effectively is one of the more
challenging aspects of data stream computation.

Queries Referencing Past Data: In the data stream model of computation, once a data
element has been streamed by, it cannot be revisited. This limitation means that ad hoc queries
that are issued after some data has already been discarded may be impossible to answer
accurately. One simple solution to this problem is to stipulate that ad hoc queries are only
allowed to reference future data: they are evaluated as though the data streams began at the
point when the query was issued, and any past stream elements are ignored (for the purposes
of that query). While this solution may not appear very satisfying, it may turn out to be perfectly
acceptable for many applications.

19. BASE Properties of NoSQL


NoSQL referred as “not only SQL”, “non-SQL”, is an approach to database design that enables
the storage and querying of data outside the traditional structures found in relational databases.
BASE stands for:
Basically Available – Rather than enforcing immediate consistency, BASE-modelled NoSQL
databases will ensure availability of data by spreading and replicating it across the nodes of the
database cluster.
Soft State – Due to the lack of immediate consistency, data values may change over time. The
BASE model breaks off with the concept of a database which enforces its own consistency,
delegating that responsibility to developers.
Eventually Consistent – The fact that BASE does not enforce immediate consistency does not
mean that it never achieves it. However, until it does, data reads are still possible (even though
they might not reflect the reality).
20. Girvan Newman for following graph

Solution
21. 5 V’s of Big Data
1. Volume:
● The name ‘Big Data’ itself is related to a size which is enormous.
● Volume is a huge amount of data.
● To determine the value of data, size of data plays a very crucial role. If the volume of
data is very large then it is actually considered as a ‘Big Data’.

2. Velocity:
● Velocity refers to the high speed of accumulation of data.
● In Big Data velocity data flows in from sources like machines, networks, social media,
mobile phones etc.

3. Variety:
● It refers to nature of data that is structured, semi-structured and unstructured data.
● It also refers to heterogeneous sources.

4. Veracity:
● It refers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy andquality and accuracy are difficult to control.

5. Value:
● After having the 4 V’s into account there comes one more V which stands for Value!. The
bulk of Data having no
● Value is of no good to the company, unless you turn it into something useful.
22. Bloom Filter

● An array of n bit; initialized to O's. . . s a key to one of the n bit locations :


● A set H of k hash functions each of which map h1, h2, ... , hk
● A set S consisting of m number of keys.
1. Bloom filter accepts all those tuples whose keys are members of the set S and rejects all
other tuples.
2. Initially all the bit locations in the array are set to O. Each key K is taken from S and one
by one all Of the k hash functions in H are applied on this key K. All bit locations
produced by hi(K) are set to 1. hl(K), ...i X(K)
3. On arrival of a test tuple with key K, all the k different hash functions from H are once
again applied on the test key K. If all the bit locations thus produced are I, the tuple with
the key K is accepted. If even a single bit location out of these turns out to be O, then the
tuple is rejected.
4. A Bloom filters is like a hash table ,and simply uses one bit to keep track whether an
item hashed to the location.
5. If k=1 , it’s equivalent to a hashing based fingerprint system.
6. If n=cm for small constant c,such as c=8 ,then k=5 or 6 ,the false positive probability is
just over 2% .
7. It’s interesting that when k is optimal k=ln(2)*(n/m) , then p= 1/2. An optimized Bloom
filters looks like a random bit-string

Algo:
Example

23.

One-time queries (a class that includes traditional DBMS queries) are queries that are
evaluated once over a point-in-time snapshot of the data set, with the answer returned to the
user.
Continuous queries, on the other hand, are evaluated continuously as data streams continue
to arrive. Continuous queries are the more interesting class of data stream queries, and it is to
them that we will devote most of our attention. The answer to a continuous query is produced
over time, always reflecting the stream data seen so far. Continuous query answers may be
stored and updated as new data arrives, or they may be produced as data streams themselves.
Sometimes one or the other mode is preferred. For example, aggregation queries may involve
frequent changes to answer tuples, dictating the stored approach, while join queries are
monotonic and may produce rapid, unbounded answers, dictating the stream approach.
A predefined query is one that is supplied to the data stream management system before any
relevant data has arrived. Predefined queries are generally continuous queries, although
scheduled one-time queries can also be predefined.
Ad hoc queries, are issued online after the data streams have already begun. Ad hoc queries
can be either one-time queries or continuous queries. Ad hoc queries complicate the design of a
data stream management system, both because they are not known in advance for the
purposes of query optimization, identification of common subexpressions across queries, etc.,
and more importantly because the correct answer to an ad hoc query may require referencing
data elements that have already arrived on the data streams (and potentially have already been
discarded).
24. The snapshot of 10 transactions is given below for online shopping that generates
big data. Threshold value = 4 and Hash function= (i*j) mod 10 Find the frequent item sets
purchased for such big data by using suitable algorithm. Analyse the memory
requirements for it.

25. Sampling Data techniques in a Stream


26. Features of R Programming
R is a programming language and is used for environment statistical computing and graphics.
The following software is required to learn and implement statistics in R:
1. R software
2. RStudio IDE

R is an open-source (free!) software environment for statistical computing and graphics. It


compiles and runs on a wide variety of UNIX platforms, Windows and MacOS, making it an ideal
platform for all your statistical computing needs. R is frequently used by data scientists,
statisticians, and researchers.

There are many things R can do for data scientists and analysts. These key features are what set
R apart from the crowd of statistical languages:
1. Open-source: R is an open-source software environment. It is free of cost and can be
adjusted and adapted according to the user’s and the project’s requirements. You can
make improvements and add packages for additional functionalities. R is freely available.
You can learn how to install R, Download and start practicing it.
2. Strong Graphical Capabilities: R can produce static graphics with production quality
visualizations and has extended libraries providing interactive graphic capabilities. This
makes data visualization and data representation very easy. From concise charts to
elaborate and interactive flow diagrams, all are well withinR’srepertoire.
3. Highly Active Community: R has an open-source library which is supported by its
growing number of users. The R environment is continuously growing. This growth is
due to its large user-base.
4. A Wide Selection of Packages: CRAN or Comprehensive R Archive Network houses
high-quality interactive graphics, web application development, quantitative analysis
machine learning procedures, there is a package for every scenario available. R contains a
sea of packages for all the forms of disciplines like astronomy, biology, etc. While R was
originally used for academic purposes, it is now being used in industries as well.
5. Comprehensive Environment:R has a very comprehensive development environment
meaning it helps in statistical computing as well as software development. R is an
object-oriented programming language. It also has a robust package called shiny which
can be used to produce full-fledged web apps. Combined with data analysis and data
visualization, R can be used for highly interactive online data-driven storytelling.
6. Can Perform Complex Statistical Calculations: R can be used to perform simple and
complex mathematical and statistical calculations on data objects of a wide variety. It can
also perform such operations on large data sets.
7. Distributed Computing: In distributed computing, tasks are split between multiple
processing nodes to reduce processing time and increase efficiency. R has packages like
ddR and multiDplyr that enable it to use distributed computing to process large data sets.
8. Running Code Without a Compiler: R is an interpreted language which means that it does
not need a compiler to make a program from the code. R directly interprets provided code
into lower-level calls and pre-compiled code.
9. Interfacing with Databases: R contains several packages that enable it to interact with
databases like ROracle, OpenDatabase Connectivity Protocol, RmySQL, etc.
10. Data Variety: R can handle a variety of structured and unstructured data. It also provides
various data modeling and data operation facilities due to its interaction with databases.
11. Machine Learning: R can be used for machine learning as well. The best use of R when it
comes to machine learning is in case of exploration or when building one-off models.
12. Data Wrangling: Data wrangling is the process of cleaning complex and inconsistent data
sets to enable convenient computation and further analysis. This is a very time taking
process. R with its extensive library of tools can be used for database manipulation and
wrangling.
13. Cross-platform Support: R is machine-independent. It supports cross-platform operation.
Therefore, it can be used on many different operating systems.
14. Compatible with Other Programming Languages: While most of its functions are written
in R itself, C, C++ or FORTRAN can be used for computationally heavy tasks. Java,
.NET, Python, C, C++, and FORTRANcan also be used to manipulate objects directly.
15. Data Handling and Storage: R is integrated with all the formats of data storage due to
which data handling becomes easy.
16. Vector Arithmetic: Vectors are the most basic data structure in R, and most other data
structures are derived from vectors. R uses vectors and vector arithmetic and does not
need a lot of looping to process a long set of values. This makes R much more efficient.
17. Compatibility with Other Data Processing Technologies: R can be easily paired with
other data processing and distributed computing technologies like Hadoop and Spark. It
is possible to remotely use a Spark cluster to process large datasets using R. R and
Hadoop can be paired as well to combine Hadoop’s large scale data processing and
distributing computing capabilities with R’s statistical computing power.
18. Generates Report in any Desired Format: R’s markdown package is the only report
generation package you will ever need when working with R. The markdown package
can help produce web pages. It can also generate reports in the form of word documents
or PowerPoint presentations. All with your R code and results embedded into them

Some Unique Features of R Programming


Due to a large number of packages available, there are many other handy features as well:
● Since R can perform operations directly on vectors, it doesn’t require too much looping.
● R can pull data from APIs, servers, SPSS files, and many other formats.
● R is useful for web scraping.
● It can perform multiple complex mathematical operations with a single command.
● Using R Markdown, it can create attractive reports that combine plaintext with code and
visualizations of the results.

You might also like