Module_1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

DATA SCIENCE AND VISUALIZATION

MODULE 1
INTRODUCTION TO DATA SCIENCE
E xploring the
F ascinating World of
Data S cience
In a world awash with information, data science has emerged as a
powerful discipline that harnesses the insights hidden within vast troves of
data. From uncovering patterns in customer behavior to predicting global
trends, this field promises to transform how we understand and interact
with the world around us.
Unders tanding B ig
Data and Data
S cience
When it comes to big data and data science, there's a lot of However, it's
important for college students to know that not everything they hear is
true. This presentation aims to explain what big data and data science
really are. We will explore the challenges, opportunities, and real-world
uses of these fields, helping students make informed decisions as they
explore the world of data.
Big Data and Data Science Hype-
Why actually data science is hyped. So, what is eyebrow-raising about Big Data and data science?
Let’s count the ways:
• There’s a lack of definitions around the most basic terminology. What is “Big Data” anyway? What does “data
science” mean? What is the relationship between Big Data and data science? Is data science the science of Big
Data? Is data science only the stuff going on in companies like Google and Facebook and tech companies?
Why do many people refer to Big Data as crossing disciplines (astronomy, finance, tech, etc.) and to data
science as only taking place in tech? Just how big is big? Or is it just a relative term? These terms are so
ambiguous, they’re well-nigh meaningless.
• There’s a distinct lack of respect for the researchers in academia and industry labs who have been working on
this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by
statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the
media describes it, machine learning algorithms were just invented last week and data was never “big” until
Google came along. This is simply not the case. Many of the methods and techniques we’re using—and the
challenges we’re facing now—are part of the evolution of everything that’s come before. This doesn’t mean
that there’s not new and exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
Understanding Big Data and Data
Science
What is "Big Data"? What is "Data The Connection
Science"
The term "Big Data" is often "Data Science" is a broad term While big data and data science
used to describe large, that encompasses various are related, their relationship is not
complex datasets that are activities, including statistical well-defined. It is unclear whether
difficult to process using analysis and machine learning. It data science is solely focused on
traditional methods. However, is not always clear how it relates analyzing big data or includes
the exact "big" varies and can to other fields like statistics and other data-driven activities.
be subjective. computer science. Clarifying these definitions is
important for establishing a more
coherent and respected field.
Res pecting the Pas t

1 Honoring Contributions 2 Avoiding Confus ion


Researchers in statistics, computer Data science isn't just a new name for
science, mathematics, and other fields existing fields. While it shares some
have laid the foundation for big data and similarities with statistics and machine
data science. It's important to recognize learning, it has its own unique qualities.
and respect their contributions.

3 Combining Skills 4 Staying Grounded


Data science combines knowledge from It's important not to get too excited about
various fields, including math, computer data science and big data. Taking a
science, and other areas. This makes it practical approach helps us understand
stronger. what they can and cannot do.
What is Data S cience?
Getting Data 1
Data science starts by collecting and
organizing relevant data, whether it's
from traditional sources or new big 2 Exploring and Analyzing Data
data platforms. This involves
understanding data structures, Data scientists use statistical, mathematical,
formats, and working with large and and computational techniques to explore,
complex datasets. understand, and find insights in the data.
They apply methods like machine learning
and predictive modeling.
Interpreting and 3
Communic ating
The final step in data science is
interpreting the findings, effectively
communicating insights, and turning
them into actionable
recommendations or solutions. This
requires good communication skills
and bridging the gap between
technical analysis and business
needs.
Creating an Effective Data Science
Field

1 Collaboration 2 Ethics
Working together with experts from Following strong ethical guidelines is
different areas is important for solving real- crucial to address concerns about privacy,
world problems and making a positive bias, and responsible use of data and
impact with data-driven insights. algorithms. This builds trust and credibility.

3 Continuous Learning 4 Interdisciplinary Approach


As the field evolves quickly, ongoing By combining knowledge from different
learning and keeping up with the latest fields like statistics, computer science, and
tools and techniques are essential for domain expertise, data science can better
remaining relevant and effective as a data tackle complex, real-world problems.
scientist.
Why Now?
• Right now, we have more information than ever before because of
all the data we collect and the powerful computers we have.

• This data tells us a lot about how people behave and how society works.
• With new technology, we can use this data to learn new things and
come up with new ideas in many different industries.

• There's a huge amount of data out there, from what we do online


to how we move in the real world, and it gives us a clear picture of

• our lives.
At the same time, computers are getting cheaper and more
powerful, so we can process and analyze all this data on a large

• scale.
This perfect combination of lots of data and advanced technology
is creating a new way of making decisions based on data and
making data science a really important field.
Datafication: Turning Life into Data

1 Intentional 2 Passive 3 Transforming


Datafication Datafication Data into
• When we actively participate in social media, • However, the datafication of our
Value
Regardless of the level of intentionality, the
online shopping, or other digital platforms, we lives extends beyond our datafication of our lives has enabled the transformation
are intentionally sharing our data and allowing it conscious choices. Our offline of information into new forms of value. Businesses,
to be collected. behaviors, such as walking governments, and other organizations are leveraging
through a store or using a this data to drive decision-making, personalize
• This type of datafication is often seen as a fair fitness tracker, are also being products and services, and uncover insights that were
exchange, where we willingly trade personal captured and turned into data, previously inaccessible.
information for the convenience and benefits of often without our explicit
these services. knowledge or consent.
Understand Datafication by watching this video-
The Evolving Landscape of Data Science
R ebranding or Industry vs. Academia Emerging Specialties
R evolution?
The growth of data As data science
There is an ongoing science has been continues to evolve,
debate about whether driven primarily by new specialties and
data science is a industry, with sub-disciplines are
genuine new field or companies like emerging, such as
simply a rebranding of Google, Facebook, machine learning,
existing disciplines like and LinkedIn natural language
statistics and pioneering the field. In processing, and data
analytics. S ome argue contrast, the academic visualization. These
that data science is world has been slower specialized skills are
merely a collection of to embrace data in high demand,
well-established science, with few reflecting the diverse
techniques, while dedicated programs or and multifaceted
others see it as a professorships. This nature of the field.
transformative disconnect highlights
approach that the practical, real-
combines diverse world focus of data
skills and science compared to
The Data Scientist: A Rising Role

1 Unique Skill Set 2 Curiosity and Persistence


Data scientists have a special Successful data scientists are
mix of skills. They know curious. They want to find
statistics, computer science, hidden insights in data. They
and have knowledge in a never give up easily, even when
specific field. This helps them working with messy and
solve complex problems that unorganized data. They search
need both technical and for valuable information that can

3 analytical abilities.
Collaborative Mindset 4 bring meaningfulRole
An Emerging change.

Data science is a team effort. The term "data scientist"


Data scientists work with appeared in the late 2000s. It
stakeholders, experts in different describes the unique skills
fields, and other data needed to make sense of the
professionals. They should be growing amount of data. This
good at communicating their role has gained recognition and
findings and turning technical prestige. In fact, Harvard
insights into practical business Business Review called data
The Data Science Toolkit

Programming Statistics Machine Learning Data Visualization


Proficiency in A strong The ability to Effective data
programming foundation in apply machine visualization
languages like statistical learning skills are needed
Python, R, and methods and algorithms and to communicate
SQL is essential modeling techniques to complex findings
for data techniques is identify patterns, in a clear and
scientists to crucial for make compelling way.
extract, drawing predictions, and
manipulate, and meaningful automate
analyze data. insights from decision-
data. making.
In the class, Rachel handed out index cards and asked everyone to profile themselves (on a relative rather than
absolute scale) with respect to their skill levels in the following domains:
• Computer science
• Math
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization
Data Science in Finance: Credit
Ratings and Trading
Credit Ratings Trading Algorithms

Data-driven models analyze an Sophisticated algorithms analyze


individual's credit history, income, market data, news, and other
and other financial information to information in real-time to identify
determine their creditworthiness trading opportunities and execute
and assign a credit score. transactions automatically.
Conclusion of Today’s Class
• Introduction to Data Science
• What is Data Science?
• Big Data and Data Science hype
• And getting past the hype
• Why now?
• Datafication
• Current landscape of perspectives
• Skill sets.
Data S cience
The world we live in is complex, random, and uncertain. At the same time,
it's one big data-generating machine. As we go about our daily lives, we
constantly produce data that can be captured and analyzed to gain
insights about the world around us. This process of turning the real world
into data and then using statistical inference to understand the underlying
processes is the foundation of data science.
Needed Statistical Inference-

• The world we live in is complex, random, and uncertain. At the same time, it’s one big data-generating
machine.
• As we commute to work on subways and in cars, as our blood moves through our bodies, as we’re shopping,
emailing, procrastinating at work by browsing the Internet and watching the stock market, as we’re building
things, eating things, talking to our friends and family about things, while factories are producing products,
this all at least potentially produces data.
• Imagine spending 24 hours looking out the window, and for every minute, counting and recording the number
of people who pass by. Or gathering up everyone who lives within a mile of your house and making them tell
you how many email messages they receive every day for the next year.
• Imagine heading over to your local hospital and rummaging around in the blood samples looking for patterns
in the DNA. That all sounded creepy, but it wasn’t supposed to. The point here is that the processes in our
lives are actually data-generating processes.
• We’d like ways to describe, understand, and make sense of these pro‐ cesses, in part because as scientists we
just want to understand the world better, but many times, understanding these processes is part of the solution
to problems we’re trying to solve.
• Data represents the traces of the real-world processes, and exactly which traces we gather are decided by our
data collection or sampling method. You, the data scientist, the observer, are turning the world into data, and
this is an utterly subjective, not objective, process.
• After separating the process from the data collection, we can see clearly that there are two sources of
randomness and uncertainty. Namely, the randomness and uncertainty underlying the process itself, and the
uncertainty associated with your underlying data collection methods.
• Once you have all this data, you have somehow captured the world, or certain traces of the world. But you
can’t go walking around with a huge Excel spreadsheet or database of millions of transactions and look at it
and, with a snap of a finger, understand the world and process that generated it.

• “This overall process of going from the world to the data, and then from the data back to the world, is the
field of statistical inference.”
Needed Statistical Inference
Data Collection Statistical Inference
The processes in our lives are data- The overall process of going from the real
generating. We gather traces of these world to data and then back to
processes through data collection and understanding the world is the field of
sampling methods, which are subjective statistical inference, which allows us to
and introduce uncertainty. draw conclusions about the processes that
generated the data.

1 2 3

Statistical Modeling
To make sens e of the data, we create
statistical models that represent our
understanding of the underlying
processes . These models use parameters
to capture the relationships in the data.
Populations and Samples
Population Sample Sampling

In statistical inference, the A sample is a subset of the The process of selecting a


population refers to the population that is observed sample from the population
entire set of objects or units or measured. Samples are is called sampling. The way
of interest, such as all used to estimate the sample is chosen can
emails sent by employees characteristics of the larger introduce additional
at a company. The population, as it is often uncertainty and bias into
population size is denoted impractical or impossible to the data, which must be
as N. measure the entire accounted for in the
population. statistical analysis.
What is a model?
• Humans try to understand the world around them by representing it in different ways. Architects capture
attributes of buildings through blueprints and three-dimensional, scaled-down versions.
• A model is our attempt to understand and represent the nature of reality through a particular lens, be it
architectural, biological, or mathematical.

Statistical modeling-
• Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the
underlying process might be with your model. What comes first? What influences what? What causes what?
What’s a test of that?
• But different people think in different ways. Some prefer to express these kinds of relationships in terms of
math. The mathematical ex‐ pressions will be general enough that they have to include parameters, but the
values of these parameters are not yet known.
What is a Model?
1 Repres enting R eality 2 Expres s ing R elations hips
A model is an attempt to understand Models can be expressed using
and represent the nature of reality mathematical equations or diagrams
through a particular lens, such as that capture the relationships
architectural, biological, or between variables and parameters.
mathematical.

3 Es timating Parameters
The parameters in a model are unknown values that need to be estimated using the
observed data. This process of fitting the model is a key step in statistical inference.
Probability Distributions-

• Probability distributions are the foundation of statistical


models.
• When we get to linear regression and Naive Bayes, you will see
how this happens in practice.
• Back in the day, before computers, scientists observed real-
world phenomenon, took measurements, and noticed that
certain mathematical shapes kept reappearing. The classical
example is the height of hu‐ mans, following a normal
distribution—a bell-shaped curve, also called a Gaussian
distribution, named after Gauss.
• Other common shapes have been named after their observers as
well (e.g., the Poisson distribution and the Weibull
distribution), while other shapes such as Gamma distributions
or exponential distributions are named after associated
mathematical objects.
• Not all processes generate data that looks like a named
distribution, but many do. We can use these functions as
building blocks of our models. an illustration of the various
common shapes, and to remind you that they only have names
because someone observed them enough times to think they
deserved names. There is actually an infinite number of
possible distributions.
Probability Distributions

Foundational Concept Named Distributions


P robability distributions are the foundation Certain mathematical shapes, such as the
of statistical models, as they provide a normal distribution, P oisson distribution,
way to assign probabilities to different and Weibull distribution, have been
outcomes or events. observed to appear in many real-world
phenomena and have been given names.

Functional Form Infinite Possibilities


P robability distributions have a specific While many real-world processes can be
functional form that includes parameters, modeled using named distributions, there
which can be estimated from the are an infinite number of possible probability
observed data to model the underlying distributions that can be used to represent
process. different types of data and processes.
Probability Distributions in Models

Parameters Probability Density Random Variables


Function
In mathematical models, The random variables in a
Greek letters are used to Probability distributions are model, denoted by x or y, are
represent the unknown expressed as probability assumed to follow a
parameters, such as μ and σ density functions, which map corresponding probability
in the normal distribution, the random variable to a distribution, which allows us
which need to be estimated positive real number and to make probabilistic
from the data. must integrate to 1 to be a statements about the
valid probability distribution. outcomes.
Fitting a Model

Specify Model Estimate Parameters Interpret Results


The first step in fitting a The model parameters are The fitted model can now
model is to specify the then estimated using be used to make
functional form of the optimization methods predictions or draw
relationship between the applied to the observed conclusions about the
variables, based on your data, resulting in a fitted underlying process that
understanding of the model with specific generated the data, which
underlying process. parameter values. is the goal of statistical
inference.
From Data to Ins ight
Data Collection 1
The data-generating processes in
the real world are captured through
subjective data collection methods, 2 S tatis tic al Modeling
introducing uncertainty into the Statistical models are used to
data. represent the relationships in the
data, with unknown parameters that
S tatis tic al Inference 3 need to be estimated from the
The process of fitting the model to observed data.
the data and drawing conclusions
about the underlying processes is
the core of statistical inference in
data science.
Conclusion
The field of data science is built upon the foundation of statistical inference, which allows us to turn
the complex, uncertain, and data-generating world into insights and understanding. By carefully
modeling the relationships in the data and fitting those models to the observed information, data
scientists can uncover the patterns and processes that shape the world around us.

You might also like