Professional Documents
Culture Documents
220CT Revision Notes
220CT Revision Notes
Short Notes
By: Salman Fazal
Contents
Normalisation 1
Big Data 3
Map/Reduce 4
Hadoop 4
NoSQL 7
Graph DB 8
Mongo DB 9
Cassandra DB 11
Data Mining 13
Extras 17
- Big Data & Hadoop
- Clusters & Consistency
220CT Notes
Salman Fazal
Normalisation
The process by which we efficiently organize data to achieve the following goals:
*in order to achieve one level of normal form, each previous level must be met.
Item Colors Price Tax <- TASK: CONVERT THIS TABLE INTO
T-shirt Red, Blue 12 0.60 THIRD NORMAL FORM
Polo Red, Yellow 12 0.60
T-shirt Red, Blue 12 0.60
Shirt Blue, Black 25 1.25
Item Colors Item Price Tax Price and tax depends on item
T-shirt Red T-shirt 12 0.60 but not colour, so its moves to a
T-shirt Blue Polo 12 0.60 different table.
Polo Red
Polo Yellow Shirt 25 1.25
Shirt Blue
Shirt Black
1
220CT Notes
Salman Fazal
In the above tables, tax is dependent on price and not item, so a new table is created.
Extra
In this table, the last column is Total Weight which is calculated from the previous row columns.
When normalizing, we need eliminated the Total Weight column. Although total weight depends
on the weight and quantity, the column is computed and it can easily be constructed outside of the
database. Therefore, the column does not belong to the database and must be discarded.
2
220CT Notes
Salman Fazal
Big Data
Definition:
1. Big data refers to data sets that grow so large that it is difficult to capture, manage, store
and analyse with typical database software tools.
2. Huge volume of data that cannot be stored and processed using the traditional approach
within the given time-frame.
Types of Data:
3
220CT Notes
Salman Fazal
Map/Reduce
Simply a way to take a big task and divide it into discrete tasks that can be done in parallel.
Cost effective and easy to use
Functionalities:
1. Split data into smaller chunks
2. Map data according to mapping key
3. Reduce and merge all related data
Pros Cons
Simplicity Restricted
Fault-tolerant Does not provide solution for
Graphs
Scalability
Hadoop
Big Data (recap) High-volume, high-velocity and high-variety data that demand cost-effective
information processing for enhanced insight and decision-making.
Hadoop Framework for parallel processing of large datasets distributed across clusters of nodes
(computers). An open-source software implementation of MapReduce.
*Hadoop QoS -> Scalable, Tolerant, Flexible & Efficient (In big data section)
4
220CT Notes
Salman Fazal
- Hadoop consists of 2 components; HDFS (storing data) and MapReduce (processing data)
Counting the number of times each word is used in every book in Coventry University Library.
We would do the following:
1. Partition the texts (pages) and put each on a separate computer or computing
element/instance (think cloud).
2. Each computing element takes care of its portion
3. The word count is then combined
5
220CT Notes
Salman Fazal
1. MapReduce library splits files into pieces (64-256MB), master assigns the tasks.
o Blocks are distributed across nodes
o Each input file is processed by one mapper (local)
o Splitting depends on file format
2. Mapping tasks
o Read contents from input then parse into key-value pairs
o Apply map operation to each pair
o File location is forwarded to master, which then forwards the file locations to reduce
workers
3. Reduce
o Fetch input sent by master
o Sort the input by key
o For each key, apply the reduce operation to the key values associated with key.
o Write result in output file then return file location to master
Summary: During the map process, the master node instructs worker nodes to process their local
input data. Hadoop performs a map process, where each worker node passes its results to the
appropriate reducer node. The master node collects the results from all reducers and compiles the
answer to the overall query.
HDFS basics
- Files split into fixed size
blocks and stored on nodes.
- Data blocks are replicated
for fault-tolerance (default is 3)
- Client talks to namenode for
metadata (info about filesystem.
Ie. Which datanodes manage
which block), and talks with
datanodes for reads and writes.
Replication
- 3 copies (default) are created. (objectives: load-balancing, fast access & fault tolerance)
First written to the same node. Second to a different node within the same rack. Third to a
node in another rack.
6
220CT Notes
Salman Fazal
NoSQL
Not Only SQL.
NoSQL databases are geared toward managing large sets of data which come in huge variety and
velocity, often in distributed systems or the cloud.
CAP Theorem
NoSQL Family
7
220CT Notes
Salman Fazal
RDBMS VS NoSQL
RDBMS NoSQL
Can store only structured data Works with all kinds of data
Structured query language (SQL) No predefined schema
Performance decreases with large volumes Can support huge volumes of data
of data (joins required) without affecting its performance
Expensive hardware required for scaling Horizontally scalable. Uses cheap
commodity hardware
Offers powerful queries such as joins and Has no functionality for joins as data is
group by denormalized
ACID Atomic, Consistent, Isolated, Durable CAP Consistent, Available & Partition-
Tolerance
8
220CT Notes
Salman Fazal
Graph DB
A database that uses graph structures with nodes, edges (relationships) & properties to
store and represent information.
- A graph is a collection of nodes (things) and edges (relationships). Both of these have
properties (in key-value pairs).
Rows Nodes
Nodes Instances of objects (entities). Eg. Billy is an instance of a user, Toyota of a car.
Relationships connection between nodes. Must have a name and direction. This adds a
structure to the graph.
Features:
1. Flexible can easily adapt to changes/additions. Ie. Relationships and properties can be
expanded, nodes can be tailored without affecting existing queries.
2. Speed as the volume increases, traversal is constant unlike RDBMS where speed is
dependent on the total amount of data stored (as several joins may be required).
3. Agility can effectively and rapidly respond to changes.
4. Schemaless unstructured (not a tabular-type format)
Traversal
Navigating a graph (from a specific node to other nodes) along relationship edges. Traversal is
bidirectional can follow incoming or outgoing nodes.
Eg. Find my friends of friend => start with my node, navigate to friend, find friends.
- Depth-first: follow the first path to its end, then return and go to second and so on.
- Breadth-first: follows all the first steps/depths then moves to second depth and so on.
Cypher query language for graph databases. Declarative language (specify what you want rather
than how to achieve it).
Commands are built from clauses that represent matches to patterns and relationships
9
220CT Notes
Salman Fazal
MongoDB
An open-source, non-relational, document-family database that provides high-performance, high-
availability and horizontal scalability.
MongoDB Architecture
MongoDB can host a number of databases.
RDBMS MongoDB
Database Database
Table Collection
Row Documents
Column Fields
MongoDB Features
10
220CT Notes
Salman Fazal
Sharding
Ie. to insert data, the application only needs to access the machine/shard responsible for that
record.
Benefits
- Splits workload work is distributed amongst machines. This increases performance as there
will be a smaller working set.
- Scaling vertical scaling is to costly, sharding lets you add more machines to your cluster.
This makes it possible to increase capacity without any downtime.
Replication
Process of duplicating data across multiple nodes. Provides redundancy and increases data
availability.
Why replication?
11
220CT Notes
Salman Fazal
Cassandra DB
A distributed, highly-scalable, fault tolerant columnar database.
Column-family Database
-
Discovery Orbital
ID Host Name Timestamp
Method Period
2016-02-12
1 11 Com Radial Velocity 326.030.32
11:32:00
2016-11-12
2 2MASS Imaging
18:05:09
Cassandra Architecture:
12
220CT Notes
Salman Fazal
1. Flexible-schema with CassandraDB, it isnt necessary to decide what fields your records will
need beforehand. You could add/remove required fields extemporaneously. For massive
databases, this is an incredible efficiency boost.
2. Scalability you could add more hardware (nodes) as the amount of data increases. This also
increases performance as more nodes can do more work within the same time.
3. Fault-tolerant In NoSQL databases (specifically Cassandra), data is replicated to multiple
nodes. Therefore, a node failure will not cause any downtime or computational failure.
Replication 3 copies of the same data are created into different nodes. If a node
fails, data is replicated again to a third node. Other objectives for replication are load-
balancing and fast access.
4. Flexible data storage Cassandra can store all data type, these could be structured, semi-
structured or unstructured.
5. Fast read and writes with linear scalability, Cassandra can perform extremely fast writes
without effecting its read efficiency.
6. Query Language an SQL-like language that makes moving from a relational database very
easy.
Extra (How Cassandra retrieves data): (Part of the NASA Exponent dataset)
This method is much more effective and has a much better performance when running large
numbers of queries!
13
220CT Notes
Salman Fazal
Data Mining
- Data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information.
- Simple terms: Data Mining refers to extracting knowledge from large amount of data.
- The information can be used for any application purpose such as to increase revenue, cut
costs, make forecasts, etc.
Data Warehousing is a process of combining data from multiple sources into one common
repository (dataset). Data Mining is a process of finding patterns in a given dataset.
Steps:
4. Here we apply the model and combine the feedback and findings on new incoming
examples.
14
220CT Notes
Salman Fazal
Classification [predictive] categorizing. process in which ideas and objects are recognized,
differentiated and understood.
Clustering [descriptive] grouping the data in more than one group based on their similarity.
o For example, news can be clustered into different groups, entertainment group,
politics, national, and world news.
Association [descriptive] identifies relationships between events that occur at one time
Sequencing [descriptive] identifies relationships that exist over a period of time.
Forecasting process of making predictions of the future based on past and present data
and analysis of trends.
Regression [predictive] statistical process for estimating the relationships among variables.
Time Series analysis examines a value as it varies over time.
- fraud detection,
- aid marketing campaigns,
- detecting diseases,
- scientific experiments,
- weather predictions
- study consumers.
A decision tree can be used as a model for a sequential decision problems under uncertainty
Pros
- easy to interpret
- easy to construct
- can handle large number of features
- very fast at testing time
Cons
15
220CT Notes
Salman Fazal
Self Organising Map. Train the map using the examples from the data sets. Used for clustering data
without knowing the class from the input data
Pros Cons
No need to specify classes Difficult to understand
decision
Can visualise data Train gets a different map
each time
Can identify new relationships
1. SAS Enterprise Miner reorganises the data mining process to create highly accurate
predictive and descriptive models.
Benefits:
Data pre-processing
- High quality data mining need data that is useful, to achieve this we need to perform some
preprocessing on the data. This combines data cleaning, data integration and data
transformation.
- Data quality issues can be expensive and time consuming to overcome.
- Cost saving, increased efficiency, reduction of risk/fraud, enable more informed decisions.
Data cleaning fill in missing values, smooth noisy data, correcting incorrect values.
Data reduction techniques that can be applied to obtain a reduces representation of data that is
much smaller in volume, yet very similar to the original data.
16
220CT Notes
Salman Fazal
EXTRAS
HADOOP AND BIG DATA
HDFS
HDFS architecture
-files stored in blocks (64-256MB)
-provides reliability through replication
Failures:
-DataNode- marked failed if no report/heartbeat is sent to NameNode. NameNode replicates lost
blocks to other nodes.
17
220CT Notes
Salman Fazal
-NameNode- a new or the backup master takes over. NameNode keeps checkpoints therefore new
master starts from previous checkpoint.
Replication:
3 copies are created;
-first on same node
-second to different node within the same rack
-third to a node in another rack
MAPREDUCE
-2 stages:
1. Map stage - split data into smaller chunks and map them into key/value pairs)
2. Reduce stage - sorts/shuffles by key, then outputs the combined results
Steps:
Input data, Split, Map, Shuffle, Reduce, Output results.
CLUSTER DATABASES
The traditional model runs on one big machine, there is a single point of failure if machine, storage
or network goes down. It is also difficult to scale up, as you would need to buy a whole new machine
(server), this is too costly and not flexible.
To resolve this, we use a cluster. A cluster combines several racks, which contains several
machines/nodes. Flexibility is achieved as data is replicated, meaning we wont need to backup as
data is always available. Also there is no single point of failure as nodes are replicated at least 2
times. If scaling-out is required, just add more nodes to the cluster. Cheaper and flexible.
18
220CT Notes
Salman Fazal
Types of replication
Synchronous all replicas are updated on every write. All nodes are always up to date.
Asynchronous writes the data as soon as possible, but reads could be out of date. Eventual
consistency.
Consistency
In relational databases, there is ACID consistency which maintains data integrity. In NoSQL,
consistency refers to whether or not reads reflect previous writes.
Inconsistencies occur if two database versions are updated at the same time, or read is made from
one machine while its still not updated.
19