Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Chapter 1: Introduction to data science

Introduction to
Data Science
What is Data Science?
What is Data Science Methodology?

Course Instructor: Anam Shahid


Chapter 1: Introduction to data science

Introduction to recommended books, and online courses


Disclaimer: The contents are adopted from various books, and two massive open online
courses, listed below:
Recommended Text Books:

 Getting Started with Data Science. Making Sense of Data with Analytics by Murtaza Haider.
IBM Press: Pearson plc
 The data science handbook by Field Cady. Wiley 2017.

Figure 1. Screen Shots of Recommended Books.

Recommended Datacamp Skill Track:


Data Literacy Fundamentals https://app.datacamp.com/learn/skill-tracks/data-literacy-fundamentals

 DS4E: Data Science for Everyone https://app.datacamp.com/learn/courses/data-science-for-


everyone

Recommended Massive Open Online Courses (MOOCs)


 What is Data Science? https://www.coursera.org/learn/what-is-datascience
 Data Science Methodology. https://www.coursera.org/learn/data-science-
methodology?specialization=introduction-data-science
 Steps to Enroll in online course # 1: What is Data Science? You can apply for financial aid for free
certificate. Visit https://www.coursera.org/learn/what-is-datascience and press Enroll for free  Next 
Audit
Chapter 1: Introduction to data science

Figure 2. Steps to enroll into MOOC # 1

 Steps to Enroll in online course #2: Data Science Methodology? Steps to Enroll:
 Visit https://www.coursera.org/learn/data-science-methodology?specialization=introduction-data-science
Chapter 1: Introduction to data science

Figure 3. Steps to enroll into MOOC # 2


Chapter 1: Introduction to data science

In this course, we'll unpack the what, why, and how of data science. By the end of this course, you'll have
a better grasp of how data is being used around you and how you can use data.

What is data science?

 Data science is the field of exploring, manipulating, and analysing data, and using data to answer
questions or make recommendations.

1. Fundamentals of Data Science

Everyone you ask will give you a slightly different description of what Data Science is, but most people
agree that it has a significant data analysis component.

Data analysis isn't new. What is new is:


 The vast quantity of data available from massively varied sources: from log files, email, social
media, sales data, patient information files, sports performance data, sensor data, security
cameras, and many more besides.
 At the same time that there is more data available than ever, we have the computing power
needed to make a useful analysis and reveal new knowledge.
Data science can help organizations understand their environments, analyze existing issues, and reveal
previously hidden opportunities.
Data scientists use data analysis to add to the knowledge of the organization by investigating data,
exploring the best way to use it to provide value to the business.

2. What is the process of data science?


 Many organizations will use data science to focus on a specific problem, and so it's essential to
clarify the question that the organization wants answered. This first and most crucial step defines
how the data science project progresses. Good data scientists are curious people who ask
questions to clarify the business need.
 The next questions are: "what data do we need to solve the problem, and where will that data
come from?".
Data scientists can:

 Analyze structured and unstructured data from many sources, and depending on the nature of the
problem, they can choose to analyze the data in different ways.
 Using multiple models to explore the data reveals patterns and outliers; sometimes, this will
confirm what the organization suspects, but sometimes it will be completely new knowledge,
leading the organization to a new approach.
 When the data has revealed its insights, the role of the data scientist becomes that of a storyteller,
communicating the results to the project stakeholders.
Chapter 1: Introduction to data science

 Data scientists can use powerful data visualization tools to help stakeholders understand the
nature of the results, and the recommended action to take.
Data Science is changing

 the way we work;


 the way we use data,
 the way organizations understand the world.
 Our approach to the world

3. Advice for New Data Scientists


Contemporary data scientists come from different backgrounds such as engineering, mathematics, and
even psychology. The secret skill is passion for continuous learning of new tools and patience to clean
and analyze data. Data scientist is to be curious, extremely argumentative and judgmental.

 Curiosity is absolute must. If you're not curious, you would not know what to do with the data.
So, curiosity being able to take a position, strong position, and then moving forward with it.
 Judgmental because if you do not have preconceived notions about things you wouldn't know
where to begin with.
 Argumentative because if you can argument and if you can plead a case, at least you can start
somewhere and then you learn from data and then you modify your assumptions and hypotheses
and your data would help you learn. And you might start at the wrong point. You may say that I
thought I believed this, but now with data I know this. So, this allows you a learning process.
The other thing that the data scientist would need is some comfort and flexibility with analytics platforms:
some software, some computing platform, but that's secondary. The most important thing is curiosity and
the ability to take positions. Once you have done that, once you've analyzed, then you've got some
answers.
And that's the last thing that a data scientist need, and that is the ability to tell a story. That once you
have your analytics, once you have your tabulations, now you should be able to tell a great story from it.
Because if you don't tell a great story from it, your findings will remain hidden, remain buried, nobody
would know. But your rise to prominence is pretty much relying on your ability to tell great stories. A
starting point would be to see “what your competitive advantage is”.

4. Do you want to be a data scientist in any field or a specific field?

Let's say you want to be a data scientist and work for an IT firm or a web-based or Internet based firm,
then you need a different set of skills.

 And if you want to be a data scientist in the health industry, then you need different sets of skills.
So, figure out first what you're interested, and what is your competitive advantage. Your
competitive advantage is not necessarily going to be your analytical skills. Your competitive
advantage is your understanding of some aspect of life where you exceed beyond others in
understanding that. Maybe it's film, maybe it's retail, maybe it's health, maybe it's computers.
 Once you've figured out where your expertise lies, then you start acquiring analytical skills.
What platforms to learn and those platforms, those tools would be specific to the industry that
you're interested in.
Chapter 1: Introduction to data science

 And then once you have got some proficiency in the tools, the next thing would be to apply your
skills to real problems, and then tell the rest of the world what you can do with it.

Quiz:
The three important qualities to possess in order to succeed as a data scientist are:

Curious.

Judgmental.

Argumentative.

Proficient in Programming.

Good at Math and Statistics.


Chapter 1: Introduction to data science

5. Data Science Job

Summary:

 Data science is the study of large quantities of data, which can reveal insights that help organizations make
strategic choices.
 There are many paths to a career in data science; most, but not all, involve a little math, a little science, and
a lot of curiosity about data.
 New data scientists need to be curious, judgemental and argumentative.
 Why data science is considered the sexiest job in the 20th century, paying high salaries for skilled workers.
Chapter 1: Introduction to data science

6. Making data work for you


But data science is actually simple. It's a set of methodologies for taking in thousands of forms of data
that are available to us today, and using them to draw meaningful conclusions. Data is being collected all
around us. Every like, click, email, credit card swipe, or tweet is a new piece of data that can be used to
better describe the present or better predict the future.

What can data do?


So what can data do? Data can describe our current state, like our energy consumption. This can be
accomplished with dashboards or alerts, simplifying time-intensive reporting processes. It can help detect
anomalous events, such as fraudulent purchases. If we have data on what has happened previously, we
can increase efficiency by automatically detecting a new event that is unexpected or abnormal. Data can
also diagnose the causes of observed events and behaviors, for instance your activity on Spotify or
Netflix. Rather than determining correlations between small numbers of events, data science techniques
help us understand complex systems with many possible causes. Finally, data can predict future events,
such as forecasting population size. We can use new techniques to take various causes into account and
predict potential outcomes. Further, we can evaluate the probability of our prediction mathematically to
clarify our level of uncertainty.

Why now?
So now we know what data science is. The next question is why is it so popular? The answer is pretty
obvious: we're collecting more data than ever before. Suppose that you visit a car dealership and fill out
some information. All of that data is automatically entered into a computer, and combined with the data
from hundreds of dealerships into one big database.

Once we have that data, it's easy to use the email address that you provided when you bought that car to
tie your car purchase data with your data from social media or web browsing. Suddenly, we have a very
complete picture about everyone who purchased a car in the last year: their ages, their likes and dislikes,
their friends and family. This additional data can be used to predict what price you can pay for your car,
what other purchases you're likely to make, or how best to sell you insurance for that new car. Data is
everywhere, and it is incredibly valuable information for businesses, organizations, and governments.

7. The data science workflow


So, how do we start to use data? In data science, we generally have four steps to any project. First, we
collect data from many sources, such as surveys, web traffic results, geo-tagged social media posts, and
financial transactions. Once collected, we store that data in a safe and accessible way.

At this point, data is in its raw form, so the next step is to prepare data. This includes "cleaning data", for
instance finding missing or duplicate values, and converting data into a more organized format.

Then, we explore and visualize the cleaned data. This could involve building dashboards to track how the
data changes over time or performing comparisons between two sets of data.

Finally, we run experiments and predictions on the data. For example, this could involve building a
system that forecasts temperature changes or performing a test to find which web page acquires more
customers.
Chapter 1: Introduction to data science

8. Applications of data science


Previously, you learned the definition of data science and the steps in a data science workflow. In this
lesson, you'll learn how data science can be applied to real-world problems.

More case studies


Let's take a deep dive into three exciting areas of data science:

 Traditional machine learning,


 Internet of Things, and
 Deep learning.

1. Case study: fraud detection


Suppose you work in fraud detection at a large bank. You'd like to use data to determine the probability
that the transaction is fake.

To answer this question, you might start by gathering information about each purchase, such as the
amount, date, location, purchase type, and card-holder's address. You'll need many examples of
transactions, including this information, as well as a label that tells you whether each transaction is valid
or fraudulent. Luckily, you probably have this information in a database. These records are called
"training data", and are used to build an algorithm. Each time a new transaction occurs, you'll give your
algorithm information, like amount and date, and it will answer the original question: What is the
probability that this transaction is fraudulent?
Chapter 1: Introduction to data science

What do we need for machine learning?


Before we can answer that question, let's walk through our example and highlight what we need for
machine learning to work its magic. First, a data science problem begins with a well-defined question.
Our question was "What is the probability that this transaction is fraudulent?" Next, we need some data to
analyze. We have months of old credit card transactions and associated metadata, like date and location,
that have already been identified as either fraudulent or valid. Finally, we need additional data every time
we want to make a new prediction. We need to have the same type of information on every new purchase
so that we could label it as "fraudulent" or "valid".

2. Case study: smart watch


Now, suppose you're trying to build a smart watch to monitor physical activity. You want to be able to
auto-detect different activities, such as walking or running. Your smart watch is equipped with a special
sensor, called an "accelerometer” that monitors motion in three dimensions. The data generated by this
sensor is the basis of your machine learning problem. You could ask several volunteers to wear your
watch and record when they are running or walking. You could then develop an algorithm that recognizes
accelerometer data as representing one of those two states: walking or running.

Internet of Things (IoT)


Your smart watch is part of a fast growing field called "the Internet of Things", also known as IoT, which
is often combined with Data Science. IoT refers to gadgets that are not standard computers, but still have
the ability to transmit data. This includes smart watches, internet-connected home security systems,
electronic toll collection systems, building energy management systems, and much, much more. IoT
data is a great resource for data science projects!

3. Case study: image recognition


Let's tackle another example. A key task for self-driving cars is identifying when an image contains a
human. What would the dataset be for this problem?
Chapter 1: Introduction to data science

We could express the picture as a matrix of numbers where each number represents a pixel. However, this
approach would probably fail if we fed the matrix into a traditional machine learning model. There's
simply too much input data!

Deep learning
We need more advanced algorithms from a subfield of machine learning called deep learning. In deep
learning, multiple layers of mini-algorithms, called "neurons", work together to draw complex
conclusions. Deep learning takes much, much more training data than a traditional machine learning
model, but is also able to learn relationships that traditional models cannot. Deep learning is used to solve
data-intensive problems, such as image classification or language understanding.

9. Data science roles and tools


In this lesson, you'll learn about the different data roles and the tools they use.

Roles in data science


You might be surprised to learn that there isn't a single job within data science. Generally, there's four
jobs: Data Engineer, Data Analyst, Data Scientist, and Machine Learning Scientist. Let's explore
each one.
Chapter 1: Introduction to data science

1. Data engineer
Data engineers control the flow of data: they build custom data pipelines and storage systems. They
design infrastructure so that data is not only collected, but easy to obtain and process. Within the data
science workflow, they focus on the first stage: data collection and storage.

Data engineering tools


Data engineers are proficient in SQL, which they use to store and organize data. They also use one of the
following programming languages like Java, Scala, or Python to process data. They use Shell on the
command line to automate and run tasks. Finally, data engineers, now more than ever, need to be
comfortable with cloud computing to ingest and store large amounts of data.

2 Data analyst
Data analysts describe the present via data. They do this by exploring the data and creating visualizations
and dashboards. To do these tasks, they often have to clean data first. Analysts have less programming
and stats experience than the other roles. Within the workflow, they focus on the middle two stages: data
preparation and exploration and visualization.

Data analyst tools


Data analysts use SQL, the same language used by data engineers, to query data. While data engineers
build and configure SQL storage solutions, analysts use existing databases to retrieve and aggregate data
relevant to their analysis. Data analysts use spreadsheets to perform simple analyses on small quantities of
data. Analysts also use Business Intelligence, or BI Tools, such as Tableau, Power BI, or Looker, to
create dashboards and share their analyses. More advanced data analysts may be comfortable with Python
or R for cleaning and analyzing data.

3. Data scientist
Data Scientists have a strong background in statistics, enabling them to find new insights from data,
rather than solely describing data. They also use traditional machine learning for prediction and
forecasting. Within the workflow, they focus on the last three stages: data preparation and exploration
and visualization, and experimentation and prediction.

Data scientist tools


Similar to analysts, data scientists have strong skills in SQL. Data scientists must be proficient in at least
Python on R. Within these languages, they use popular data science libraries, such as pandas or
tidyverse. These libraries contain reusable code for common data science tasks.

4. Machine learning scientist


Machine learning scientists are similar to data scientists, but with a machine learning specialization.
Machine learning is perhaps the buzziest part of Data Science; it's used to extrapolate what's likely to be
true from what we already know. These scientists use training data to classify larger, unrulier data,
whether its to classify images that contain a car, or create a chatbot. They go beyond traditional machine
learning with deep learning. Within the workflow, they do the last three stages with a strong focus on
prediction.
Chapter 1: Introduction to data science

Machine learning tools


Machine learning scientists use either Python or R to create their predictive models. Within these
languages, they use popular machine learning libraries, such as TensorFlow, to run powerful deep
learning algorithms.

Review:

Here's a recap what we've covered. It may be intimidating to see all these tools and languages, but they
aren't as difficult to learn as spoken languages. If you know English, it may take you years to learn
French. Programming languages are more similar to power tools. If you know how to use a power drill,
you don't necessarily know how to use an electric saw, but you can learn with a little training!
Chapter 1: Introduction to data science

You might also like