Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Lecture Notes

Introduction to Data Science and Big Data


Instructor: Dr. Faisal Kamiran Teacher Assistant: M. Ahmad Ehsan

Instructions:
1- Please write an email to msds18008@itu.edu.pk in case of any queries.
2- Please do not cram the following notes. As we won’t be expecting the line by line same answer.
3- You don’t have to memorize the notes. These are written in detail so that you can read and understand each
and everything.
1- Data Science:
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract
knowledge and insights from structured and unstructured data.

2- CRISP-DM:
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data
mining efforts.

• As a methodology, it includes descriptions of the typical phases of a project, the tasks involved with each phase, and an
explanation of the relationships between these tasks.

• As a process model, CRISP-DM provides an overview of the data mining life cycle.

The life cycle model consists of six phases with arrows indicating the most important and frequent dependencies between
phases. The sequence of the phases is not strict. In fact, most projects move back and forth between phases as necessary.
The CRISP-DM model is flexible and can be customized easily. For example, if your organization aims to detect money
laundering, it is likely that you will sift through large amounts of data without a specific modeling goal. Instead of modeling,
your work will focus on data exploration and visualization to uncover suspicious patterns in financial data. CRISP-DM
allows you to create a data mining model that fits your particular needs.

In such a situation, the modeling, evaluation, and deployment phases might be less relevant than the data understanding
and preparation phases. However, it is still important to consider some of the questions raised during these later phases
for long-term planning and future data mining goals.

1- Business Understanding:

Even before working, you should take the time to explore what your organization expects to gain from data
mining. Try to involve as many key people as possible in these discussions and document the results. The final
step of this CRISP-DM phase discusses how to produce a project plan using the information gathered here.

2- Data Understanding:

Data understanding phase of CRISP-DM involves taking a closer look at the data available for mining. Data
understanding involves accessing the data and exploring it using tables and graphics that can be organized.

3- Data Preparation:

Data preparation is one of the most important and often time-consuming aspects of data mining. In fact, it is
estimated that data preparation usually takes 50-70% of a project's time.

4- Data Modeling:

This is the point at which your hard work begins to pay off. The data you spent time preparing are brought into
the analysis, and the results begin to shed some light on the business problem posed during Business
Understanding.

5- Evaluation:

At this point, you've completed most of your data mining project. You've also determined, in the Modeling phase,
that the models built are technically correct and effective.

6- Deployment:

Deployment is the process of using your new insights to make improvements within your organization.
Alternatively, deployment can mean that you use the insights gained from data mining to elicit change in your
organization. For example, perhaps you discovered alarming patterns in your data indicating a shift in behavior
for customers over the age of 30. These results may not be formally integrated into your information systems, but
they will undoubtedly be useful for planning and making marketing decisions.

3- Data Science Techniques:

Classification:
Classification is a classic data mining technique based on machine learning. Basically, classification is used to
classify each item in a set of data into one of a predefined set of classes or groups.
Remember the example discussed in class?
Clustering:
Clustering is a data mining technique that makes a meaningful or useful cluster of objects which have similar
characteristics using the automatic technique. To make the concept clearer, we can take book management in the
library as an example. In a library, there is a wide range of books on various topics available. The challenge is how
to keep those books in a way that readers can take several books on a particular topic without hassle. By using the
clustering technique, we can keep books that have some kinds of similarities in one cluster or one shelf and label it
with a meaningful name. If readers want to grab books in that topic, they would only have to go to that shelf instead
of looking for the entire library.
Remember the example discussed in class?

Pattern Mining:

Pattern Mining analysis is one of data mining technique that seeks to discover or identify similar patterns, regular
events or trends in transaction data over a business period.

In sales, with historical transaction data, businesses can identify a set of items that customers buy together different
times in a year. Then businesses can use this information to recommend customers buy it with better deals based
on their purchasing frequency in the past.

Remember the example discussed in class?


4- Tools that are used for Data Science
 Weka
 Rapidminer
 Knime
 Tableau
 R
 Python

5- Big Data 4 V’s


1- Volume
The main characteristic that makes data “big” is the sheer volume. It makes no sense to focus on minimum
storage units because the total amount of information is growing exponentially every year. In 2010, Thomson
Reuters estimated in its annual report that it believed the world was “awash with over 800 exabytes of data and
growing.”

2- Variety
Variety is one the most interesting developments in technology as more and more information is digitized.
Traditional data types (structured data) include things on a bank statement like date, amount, and time. These
are things that fit neatly in a relational database.

Structured data is augmented by unstructured data, which is where things like Twitter feeds, audio files, MRI
images, web pages, web logs are put — anything that can be captured and stored but doesn’t have a meta
model (a set of rules to frame a concept or idea — it defines a class of information and how to express it) that
neatly defines it.

Unstructured data is a fundamental concept in big data. The best way to understand unstructured data is by
comparing it to structured data. Think of structured data as data that is well defined in a set of rules. For
example, money will always be numbers and have at least two decimal points; names are expressed as text; and
dates follow a specific pattern.

With unstructured data, on the other hand, there are no rules. A picture, a voice recording, a tweet — they all
can be different but express ideas and thoughts based on human understanding. One of the goals of big data is
to use technology to take this unstructured data and make sense of it.

3- Veracity
Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is
representative? Every good manager knows that there are inherent discrepancies in all the data collected.

4- Velocity
Velocity is the frequency of incoming data that needs to be processed. Think about how many SMS messages,
Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of
every day, and you’ll have a good appreciation of velocity. A streaming application like Amazon Web Services
Kinesis is an example of an application that handles the velocity of data.

6- Trick for Big Data Processing

Instead of collecting all data from all devices and then process it, you just process data where it is located
and collect the results. This is called bringing computation to data.

7- Big Data Platforms

Hadoop:
Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free
for anyone to use or modify, with a few exceptions) which anyone can use as the "backbone" of their big data
operations.

It has two main components.

Hadoop Distributed File-System (HDFS):

Hadoop Distributed File System allows data to be stored in an easily accessible format, across a large number of
linked storage devices.

Map Reduce:

MapReduce is named after the two basic operations this module carries out - reading data from the database,
putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the
number of males aged 30+ in a customer database (reduce).

You might also like