Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Data Science and Analytics

Introduction

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Course Overview
• Course Information
• B.E Course major in Software Engineering

• Course Objectives
CLOs Description Taxonomy level PLO
1 Determine fundamental information to get insight C3 1
into the challenges with big data.
2 Evaluate techniques for storing and process large C6 2,4
amounts of structured and unstructured data
3 Plan application of big data to get valuable C5 3
information on market trends
4 Implement and deploy a sample project for P3 5
extracting useful information from a mid sized
dataset.
Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 2
isma.farah@teacher.muet.edu.pk
lower order Intermediate Higher order

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 3
isma.farah@teacher.muet.edu.pk
Program Learning Objectives (PLOs):
The twelve graduate attributes provided by the PEC as per Manual of Accreditation 2014 have been adopted by the Department of
Software Engineering MUET, Jamshoro as the Program Learning Outcomes (PLOs) for its Bachelor’s in Software Engineering Program.

Sr.# PLO Description


Engineering An ability to apply knowledge of mathematics, science,
1.
Knowledge Engineering fundamentals and an engineering specialization to the solution of complex engineering problems.

An ability to identify, formulate, research literature, and analyze


2. Problem Analysis complex engineering problems reaching substantiated conclusions using first principles of mathematics, natural sciences and
engineering sciences.

An ability to design solutions for complex engineering problems


Design /
3. and design systems, components or processes that meet specified needs with appropriate consideration for public health and safety,
Development of Solutions
cultural, societal, and environmental considerations.

An ability to investigate complex engineering problems in a


4. Investigation methodical way including literature survey, design and conduct of experiments, analysis and interpretation of experimental data,
and synthesis of information to derive valid conclusions.

An ability to create, select and apply appropriate techniques,


5. Modern Tool Usage resources, and modern engineering and IT tools, including prediction and modeling, to complex engineering activities, with an
understanding of the limitations.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 4
isma.farah@teacher.muet.edu.pk
Course Contents • Linked Big Data
• Introduction to Big Data Analytics
• Graph Computing
• Data Analytics,
• Analytics for Business Intelligence • Graph Analytics,
• Big Data Analytics • Graphical Models and Bayesian Networks,
• Big Data Classifications • Social and Information Networks
• Big Data for Cognitive Mobile Analytics. • Basic Network Properties and Graphs
• Big Data Platforms • Random Networks
• Introduction to NOSQL databases • Small-world and Scale-free Properties
• Comparing NOSQL DBMS to relational DBMS
• Models of Network Formation
• Aggregate Data Model
• Big Data Model • Big Data Visualization
• Comparing NOSQL and Relational data models • Data Exploration Components
• Big Data Stores
• Techniques to Explore Data
• Processing Data on Hadoop
• Tools for Exploratory Data Analysis
• Big Data Modeling Techniques
• Extrapolation of Data Analysis to Big Data, Techniques and Tools
• Denormalization, Aggregates, Joins
• General Modelling Techniques (Enumerable Keys, Dimensionality • Data Visualization Techniques
Reduction, Index table)
• Big Data Analytics using Machine Learning Algorithms
• Recommendation,
• Clustering,
• Classification,

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 5
isma.farah@teacher.muet.edu.pk
Reference Books
• Thomas Erl, Wajid Khattak, Paul Buhler, Big Data Fundamentals: Concepts,
Drivers and Techniques, Latest edition, Prentice Hall Publications
• T. Fawcett and F. Provost , Data Science for Business: What you Need to Know
about Data Mining and Data Analytic Thinking, Publisher: O’Reilly Media, Latest
Edition
• Ramesh Sharda, Dursun Delen, Efraim Turban, Business Intelligence and
Analytics: Systems for Decision Support, Publisher: Pearson/Prentice Hall, Latest
Edition
• Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, Latest Edition

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 6
isma.farah@teacher.muet.edu.pk
Assessment
• Attendance - Avoid Unexcused Absent
• Team / Individual Project - 20
• a semester-long team project with implementation, report, and few presentations
(assignments & homework too !!)
• Midterm Exam - 20
• Final Exam - 60
• Active participation during the lecture will be awarded!
• actively participating during the lecture through questions or suggestions
• acting towards improving the quality of the lecture
• help other students to better understand concepts learned in the lecture
• other positive behaviors during the lecture
• 1% towards total mark will be awarded! (May Be ☺ )

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 7
isma.farah@teacher.muet.edu.pk
Study Tips
• read and study lecture slides in advance
• don't be late or miss a lecture
• try to understand everything in the lecture
• ask question on what is unclear or what you do not understand during
the lecture
• master the concepts learned during the lecture through homework and
assignments
• Utilize Q&A time effectively

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 8
isma.farah@teacher.muet.edu.pk
Data Science

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk 9
isma.farah@teacher.muet.edu.pk
• Data science is defined as the
collection of fundamental principles
that promote information and
knowledge gaining from data.
• The techniques and applications
that are used help to analyze critical
data to support organizations in
understanding their environment
and in taking better decisions on
time.
• Data science is also defined as a
collection of fundamental principles
that promotes taking information
and knowledge from data.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk10
isma.farah@teacher.muet.edu.pk
• The ability to manage, analyze and
act on data (“data-driven decision
systems”).
• The term big data analytics is
associated with Data Science,
Business Intelligence And Business
Analytics.
• “Big Data Analytics” refers to
advanced analytic techniques,
considering large and various types
of datasets to examine and extract
knowledge from big data,
constituting a sub-process in
gaining insights from big data
process.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk11
isma.farah@teacher.muet.edu.pk
Data is characterized as the lifeblood of decision-making and
the raw material for accountability. Without high-quality
data providing the right information on the right things at
the right time, designing, monitoring and evaluating effective
policies becomes almost impossible.

Source: United Nations: A world that counts. Mobilizing the data revolution for sustainable development. United Nations, New York (2014)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk12
isma.farah@teacher.muet.edu.pk
Data Science and Analytics
Class 3rd and 4th

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk13
isma.farah@teacher.muet.edu.pk
NoSQL Database
Unstructured / Semi Structured Data

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk14
isma.farah@teacher.muet.edu.pk
What is NOSQL?
• The Name:
• Stands for Not Only SQL
• The term NOSQL was introduced by Carl Strozzi in 1998 to name his file-based
database
• It was again re-introduced by Eric Evans when an event was organized to discuss
open source distributed databases
• Eric states that “… but the whole point of seeking alternatives is that you need to
solve a problem that relational databases are a bad fit for. …”

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Why NoSQL?
• NoSQL database systems are today an effective solution to manage large data
sets distributed over many servers.
• A primary point of interest in NoSQL systems is their support for next-generation
Web applications, for which relational DBMSs are not well suited.
• These are simple OLTP applications for which
(i) data have a structure that does not fit well in the rigid structure of relational tables,

(ii) access to data is based on simple read–write operations,

(iii) relevant quality requirements include scalability and performance, as well as a certain
level of consistency

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk16
isma.farah@teacher.muet.edu.pk
Who is using them?

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
NoSQL Taxonomy

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk18
isma.farah@teacher.muet.edu.pk
Key-value
• Focus on scaling to huge amounts of data
• Designed to handle massive load
• Based on Amazon’s dynamo paper
• Data model: (global) collection of Key-value pairs
• Dynamo ring partitioning and replication
• Example: (DynamoDB)
• items having one or more attributes (name, value)
• An attribute can be single-valued or multi-valued like set.
• items are combined into a table

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Key-value
• Basic API access:
• get(key): extract the value given a key
• put(key, value): create or update the value given its key
• delete(key): remove the key and its associated value
• execute(key, operation, parameters): invoke an operation to the value (given its
key) which is a special data structure (e.g. List, Set, Map .... etc)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Key-value
Pros:
• very fast
• very scalable (horizontally distributed to nodes based on key)
• simple data model
• eventual consistency
• fault-tolerance

Cons:
- Can’t model more complex data structure such as objects

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Key-value
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}), where restricted SQL; select, delete, GetAttributes, and
attribute is a couple (name, value) PutAttributes operations

Redis Salvatore set of couples (key, value), where value is primitive operations for each value type
Sanfilippo simple typed value, list, ordered (according
to ranking) or unordered set, hash value

Dynamo Amazon like SimpleDB simple get operation and put in a context

Voldemort LinkeId like SimpleDB similar to Dynamo

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Document-based
• Can model more complex objects

• Inspired by Lotus Notes

• Data model: collection of documents

• Document: JSON (JavaScript Object Notation is a data model, key-value


pairs, which supports objects, records, structs, lists, array, maps, dates,
Boolean with nesting), XML, other semi-structured formats.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Document-based
• Example: (MongoDB) document
• {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3", Richard:
"1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Document-based
Name Producer Data model Querying

MongoDB 10gen object-structured documents stored in manipulations with objects in collections


collections; (find object or objects via simple selections
each object has a primary key called and logical expressions, delete, update,)
ObjectId

Couchbase Couchbase document as a list of named (structured) by key and key range, views via Javascript
items (JSON document) and MapReduce

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Column-based
• One column family can have variable
numbers of columns
• Cells within a column family are sorted “physically”
• Very sparse, most cells have null values
• Comparison: RDBMS vs column-based NOSQL
• Query on multiple tables
• RDBMS: must fetch data from several places on disk and glue together
• Column-based NOSQL: only fetch column families of those columns that are
required by a query (all columns in a column family are stored together on the
disk, so multiple rows can be retrieved in one read operation → data locality)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Column-based
• Based on Google’s BigTable paper

• Like column oriented relational databases (store data in


column order) but with a twist

• Tables similarly to RDBMS, but handle semi-structured

• Data model:

• Collection of Column Families

• Column family = (key, value) where value = set of


related columns (standard, super)

• indexed by row key, column key and timestamp

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Column-based

• Example: (Cassandra column family--timestamps removed for simplicity)


UserProfile = {
Cassandra = { emailAddress:”casandra@apache.org” , age:”20”}
TerryCho = { emailAddress:”terry.cho@apache.org” , gender:”male”}
Cath = { emailAddress:”cath@apache.org” , age:”20”,gender:”female”,address:”Seoul”}
}

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Column-based
Name Producer Data model Querying
BigTable Google set of couples (key, {value}) selection (by combination of row, column,
and time stamp ranges)

HBase Apache groups of columns (a BigTable clone) JRUBY IRB-based shell (similar to SQL)

Hypertable Hypertable like BigTable HQL (Hypertext Query Language)

CASSANDRA Apache columns, groups of columns corresponding to a simple selections on key, range queries,
(originally key (supercolumns) column or columns ranges
Facebook)
PNUTS Yahoo (hashed or ordered) tables, typed arrays, flexible selection and projection from a single table
schema (retrieve an arbitrary single record by primary
key, range queries, complex predicates,
ordering, top-k)

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Graph-based
• Focus on modeling the structure of data (interconnectivity)
• Scales to the complexity of data
• Inspired by mathematical Graph Theory (G=(E,V))
• Data model:
• (Property Graph) nodes and edges
• Nodes may have properties (including ID)
• Edges may have labels or roles
• Key-value pairs on both
• Interfaces and query languages vary
• Single-step vs path expressions vs full recursion
• Example:
• Neo4j, FlockDB, Pregel, InfoGrid …

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Conclusion
• NOSQL database cover only a part of data-intensive cloud applications
(mainly Web applications)
• Problems with cloud computing:
• SaaS (Software as a Service or on-demand software) applications require enterprise-
level functionality, including ACID transactions, security, and other features
associated with commercial RDBMS technology, i.e. NOSQL should not be the only
option in the cloud
• Hybrid solutions:
• Voldemort with MySQL as one of storage backend
• deal with NOSQL data as semi-structured data
→ integrating RDBMS and NOSQL via SQL/XML

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Conclusion
• next generation of highly scalable and elastic RDBMS:
NewSQL databases (from April 2011)
• they are designed to scale out horizontally on shared nothing machines,

• still provide ACID guarantees,

• applications interact with the database primarily using SQL,

• the system employs a lock-free concurrency control scheme to avoid user shut down,

• the system provides higher performance than available from the traditional systems.

• Examples: MySQL Cluster (most mature solution), VoltDB, Clustrix, ScalArc, etc.

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
NEXT - 3 Major papers for NOSQL
• Three major papers were the “seeds” of the NOSQL movement:
• BigTable (Google)
• DynamoDB (Amazon)
• Ring partition and replication

• Gossip protocol (discovery and error detection)

• Distributed key-value data stores

• Eventual consistency

• Amazon and consistency


• CAP Theorem

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk
isma.farah@teacher.muet.edu.pk
Q&A

Data Science and Analytics (SW326)– 6th Term – 19SW Dr. Isma Farah Siddiqui isma.farah@faculty.muet.edu.pk34
isma.farah@teacher.muet.edu.pk

You might also like