Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Data Science Fundamentals

• What is Data Science?

• Data Science is the area of study which involves extracting


insights from vast amounts of data using various scientific
methods, algorithms, and processes.

• It helps to discover hidden patterns from the raw data.

• Data Science is an interdisciplinary field that allows us to


extract knowledge from structured or unstructured data.

https://www.guru99.com/data-science-tutorial.html
History of data science

• In a paper published in 1962, American statistician John W.


Tukey wrote that data analysis “is intrinsically an empirical
science.”

• Four years later, Peter Naur, a Danish software programming


pioneer, proposed datalogy -- "the science of data and data
processes" -- as an alternative to computer science.

• He later used the term data science in his 1974 book, Concise
Survey of Computer Methods, describing it as "the science of
dealing with data" -- though again in the context of computer
science, not analytics.

https://www.techtarget.com/searchenterpriseai/definition/data-science
• In 1996, the International Federation of Classification Societies
included data science in the name of the conference it held that
year.

• In a presentation at the event, Japanese statistician Chikio


Hayashi said data science includes three phases: "design for
data, collection of data and analysis on data."

• A year later, C. F. Jeff Wu, a university professor in the U.S. who


was born in Taiwan, proposed that statistics be renamed data
science and that statisticians be called data scientists.

https://www.techtarget.com/searchenterpriseai/definition/data-science
• American computer scientist William S. Cleveland outlined
data science as a full analytics discipline in an article titled
"Data Science: An Action Plan for Expanding the Technical
Areas of Statistics," which was published in 2001 in the
International Statistical Review.

• Two research journals focused on data science were launched


in the next two years.

• The first use of data scientist as a professional job title is


credited to DJ Patil and Jeff Hammerbacher, who jointly
decided to adopt it in 2008 while working at LinkedIn and
Facebook, respectively.

https://www.techtarget.com/searchenterpriseai/definition/data-science
• Why Data Science?

• Data is the oil for today’s world. With the right tools, technologies,
algorithms, we can use data and convert it into a distinct business
advantage

• Data Science can help you to detect fraud using advanced machine learning
algorithms

• It helps you to prevent any significant monetary losses

• Allows to build intelligence ability in machines

• It enables you to take better and faster decisions

• It helps you to recommend the right product to the right customer to


enhance your business
https://www.guru99.com/data-science-tutorial.html
Components

https://www.guru99.com/data-science-tutorial.html
• Statistics:
• Statistics is the most critical unit of Data Science basics, and it
is the method or science of collecting and analyzing numerical
data in large quantities to get useful insights.

• Visualization:
• Visualization technique helps you access huge amounts of data
in easy to understand and digestible visuals.

https://www.guru99.com/data-science-tutorial.html
• Machine Learning:
• Machine learning explores the building and study of
algorithms that learn to make predictions about
unforeseen/future data.

• Deep Learning:
• Deep learning method is new machine learning research
where the algorithm selects the analysis model to follow.

https://www.guru99.com/data-science-tutorial.html
Key Pillars of Data Science

https://www.geeksforgeeks.org/data-science-fundamentals/
• Domain Knowledge:
– Most people thinking that domain knowledge is not
important in data science but it is essential. The foremost
objective of data science is to extract useful insights from
that data so that it can be profitable to the company’s
business.
– One needs to know how to ask the right questions from
the right people so that we can perceive the appropriate
information we need to obtain the information we need.
There are some visualization tools used on the business
end like Tableau that helps us display your valuable results
or insights in a proper non-technical format such as graphs
or pie charts that business people can understand.

https://www.geeksforgeeks.org/data-science-fundamentals/
• Math Skills:

• Linear Algebra, Multivariable Calculus & Optimization


Technique: These three things are very important as they help
us in understanding various machine learning algorithms that
play an important role in Data Science.

• Statistics & Probability: Understanding of Statistics is very


significant as this is a part of Data analysis. Probability is also
significant to statistics and it is considered a prerequisite for
mastering machine learning.

https://www.geeksforgeeks.org/data-science-fundamentals/
• Computer Science:

– Programming Knowledge: One needs to have a good grasp


of programming concepts such as Data structures and
Algorithms. The programming languages used are Python, R,
Java, Scala. C++ is also useful in some places where
performance is very important.
– Relational Databases: One needs to know databases such
as SQL or Oracle so that he/she can retrieve the necessary
data from them whenever required.
– Non-Relational Databases: There are many types of non-
relational databases but mostly used types are Cassandra,
HBase, MongoDB, CouchDB, Redis, Dynamo.

https://www.geeksforgeeks.org/data-science-fundamentals/
– Machine Learning:
– It is one of the most vital parts of data science and the
hottest subject of research among researchers so each year
new advancements are made in this. One at least needs to
understand basic algorithms of Supervised and
Unsupervised Learning.
– There are multiple libraries available in Python and R for
implementing these algorithms.

https://www.geeksforgeeks.org/data-science-fundamentals/
• Distributed Computing:
• It is also one of the most important skills to handle a large
amount of data because one can’t process this much data on a
single system.

• The tools that mostly used are Apache Hadoop and Spark. The
two major parts of these tolls are HDFS(Hadoop Distributed
File System) that is used for collecting data over a distributed
file system.

• Another part is map-reduce, by which we manipulate the data.


One can write map-reduce in programs in Java or Python.
There are various other tools such as PIG, HIVE, etc.

https://www.geeksforgeeks.org/data-science-fundamentals/
• Communication Skill:

– It includes both written and verbal communication. What


happens in a data science project is after drawing
conclusions from the analysis, the project has to
be communicated to others.
– Sometimes this may be a report you send to your boss or
team at work. Other times it may be a blog post. Often it
may be a presentation to a group of colleagues.
– Regardless, a data science project always involves some
form of communication of the projects’ findings. So it’s
necessary to have communication skills for becoming a data
scientist.

https://www.geeksforgeeks.org/data-science-fundamentals/
Skills of a Data scientist

Drew Conway’s Venn diagram of data science in which


data science is the intersection of three sectors –
Substantive expertise, hacking skills, and math &
statistics knowledge.
https://www.geeksforgeeks.org/data-science-fundamentals/
Data Science Process

https://www.techtarget.com/searchenterpriseai/definition/data-science
Life cycle
Data Science Jobs Roles

• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
• Business Analyst
• Data/Analytics Manager

https://www.guru99.com/data-science-tutorial.html
• Data Scientist:
• Role: A Data Scientist is a professional who manages
enormous amounts of data to come up with compelling
business visions by using various tools, techniques,
methodologies, algorithms, etc.
• Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

• Data Engineer:
• Role: The role of a data engineer is of working with large
amounts of data. He develops, constructs, tests, and
maintains architectures like large scale processing systems
and databases.
• Languages: SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C +
+, and Perl
https://www.guru99.com/data-science-tutorial.html
• Data Analyst:
• Role: A data analyst is responsible for mining vast
amounts of data. They will look for relationships, patterns,
trends in data. Later he or she will deliver compelling
reporting and visualization for analyzing the data to take
the most viable business decisions.
• Languages: R, Python, HTML, JS, C, C+ + , SQL

• Statistician:
• Role: The statistician collects, analyses, and understands
qualitative and quantitative data using statistical theories
and methods.
• Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark,
and Hive
https://www.guru99.com/data-science-tutorial.html
• Data Administrator:
• Role: Data admin should ensure that the database is
accessible to all relevant users. He also ensures that it is
performing correctly and keeps it safe from hacking.
• Languages: Ruby on Rails, SQL, Java, C#, and Python

• Business Analyst:
• Role: This professional needs to improve business
processes. He/she is an intermediary between the
business executive team and the IT department.
• Languages: SQL, Tableau, Power BI and, Python

https://www.guru99.com/data-science-tutorial.html
Roles and Responsibilities of a Data Scientist

• Knowledge about unstructured data management


• Hands-on experience in SQL database coding
• Able to understand multiple analytical functions
• Data mining used for Processing, cleansing, and
verifying the integrity of data used for analysis
• Obtain data and recognize the strength
• Work with professional DevOps consultants to help
customers operationalize models

https://www.guru99.com/data-science-tutorial.html
Tools for Data Science

https://www.guru99.com/data-science-tutorial.html
• R: It is open-source software. It is easy to learn
R as it is well documented. It offers strong
statistical capabilities.
• Python is another popular open-source
scripting language. It is supports libraries such
as Numpy, Scipy, and MatPlotLib. You can
perform any statistical operation, or you can
build any model using these libraries.
• SAS: It is the widely used analytical tool in the
commercial analytics market. With a plethora
of statistical functions and good GUI.
Applications of Data Science

• Internet Search:
• Google search uses Data science technology to search for a
specific result within a fraction of a second
• Recommendation Systems:
• To create a recommendation system. For example, “suggested
friends” on Facebook or suggested videos” on YouTube,
everything is done with the help of Data Science.
• Image & Speech Recognition:
• Speech recognizes systems like Siri, Google Assistant, and Alexa
run on the Data science technique.
• Moreover, Facebook recognizes your friend when you upload a
photo with them, with the help of Data Science.
https://www.guru99.com/data-science-tutorial.html
• Gaming world:
• EA Sports, Sony, Nintendo are using Data science technology.
This enhances your gaming experience. Games are now
developed using Machine Learning techniques, and they can
update themselves when you move to higher levels.

• Online Price Comparison:


• PriceRunner, Junglee, Shopzilla work on the Data science
mechanism. Here, data is fetched from the relevant websites
using APIs.

https://www.guru99.com/data-science-tutorial.html
• Banking: loan/credit card approval
– Predict good customers based on old customers

• Customer relationship management


– Identify those who are likely to leave for a competitor

• Targeted marketing
– Identify likely responders to promotions

• Fraud detection:
– From an online stream of event identify fraudulent events

• Manufacturing and production


– Automatically adjust knobs when process parameter
changes
• Medicine: disease outcome, effectiveness of treatments
– Analyze patient disease history: find relationship between
disease

• Scientific data analysis


– Gene analysis

• Web site/store design and promotion


– Find affinity of visitor to pages and modify layout
Challenges of Data Science Technology

• A high variety of information & data is required for accurate


analysis
• Not adequate data science talent pool available
• Management does not provide financial support for a data
science team
• Unavailability of/difficult access to data
• Business decision-makers do not effectively use data Science
results
• Explaining data science to others is difficult
• Privacy issues
• Lack of significant domain expert
• If an organization is very small, it can’t have a Data Science team

https://www.guru99.com/data-science-tutorial.html
Data Analytics Challenge

32
DATA

https://www.slideshare.net/hemapani/data-science-in-the-real-world-
making-a-difference
https://www.slideshare.net/hemapani/data-science-in-the-real-
world-making-a-difference
https://www.slideshare.net/hemapani/data-science-in-the-
real-world-making-a-difference
INFORMATION

https://www.slideshare.net/hemapani/data-science-in-the-real-
world-making-a-difference
https://www.slideshare.net/hemapani/data-science-in-the-real-
world-making-a-difference
https://www.slideshare.net/hemapani/data-science-in-the-real-
world-making-a-difference
https://www.slideshare.net/hemapani/data-science-in-the-real-world-making-
a-difference
https://www.slideshare.net/hemapani/data-science-in-the-real-world-making-a-
difference
Total Information Awareness

• Following the terrorist attack of Sept. 11, 2001, it was noticed


that there were four people enrolled in different flight
schools, learning how to pilot commercial aircraft, although
they were not affiliated with any airline.

• TIA is run by the Defense Advanced Research Projects Agency


(DARPA), a branch of the Department of Defense that works
on military research.
Meaningfulness of Answers
• A big risk when data mining is that you will “discover” patterns
that are meaningless.
• Statisticians call it Bonferroni’s principle: (roughly) if you look in
more places for interesting patterns than your amount of data
will support, you are bound to find crap.

• Bonferroni’s principle helps us avoid treating random


occurrences as if they were real.
• Bonferroni’s principle says that we may only detect terrorists by
looking for events that are so rare that they are unlikely to
occur in random data.
• Bonferroni’s principle says that we may only detect terrorists by
looking for events that are so rare that they are unlikely to
occur in random data.
Examples
• A big objection to TIA was that it was looking for so many
vague connections that it was sure to find things that were
bogus and thus violate innocents’ privacy.

• The Rhine Paradox: a great example of how not to conduct


scientific research.
Rhine Paradox
• David Rhine was a parapsychologist in the 1950’s who
hypothesized that some people had Extra-Sensory Perception.

• He devised an experiment where subjects were asked to


guess 10 hidden cards
• --- red or blue.

• He discovered that almost 1 in 1000 had ESP --- they were


able to get all 10 right!
• He told these people they had ESP and called them in for
another test of the same type.

• Alas, he discovered that almost all of them had lost their ESP.

• What did he conclude?


Answer on next slide.
• He concluded that you shouldn’t tell people
they have ESP; it causes them to lose it.
A Concrete Example

• This example illustrates a problem with intelligence-gathering.

• Suppose we believe that certain groups of evil-doers are


meeting occasionally in hotels to plot doing evil.

• We want to find people who at least twice have stayed at the


same hotel on the same day.
The Details
• 109 people being tracked.

• 1000 days.

• Each person stays in a hotel 1% of the time (10 days out of


1000).

• Hotels hold 100 people (so 105 hotels).

• If everyone behaves randomly (I.e., no evil-doers) will the


data mining detect anything suspicious?
Calculations

• Probability that persons p and q will be at the same hotel on


day d : 1/100 * 1/100 * 10-5 = 10-9 .

• Probability that p and q will be at the same hotel on two given


days: 10-9 * 10-9 = 10-18 .

• Pairs of days: 5*10 5 .


• Probability that p and q will be at the same hotel on some two
days: 5*10 5 * 10-18 = 5*10-13 .

• Pairs of people: 5*1017 .

• Expected number of suspicious pairs of people: 5*1017 * 5*10-


13
= 250,000.
Conclusion
• Suppose there are (say) 10 pairs of evil-doers who definitely
stayed at the same hotel twice.

• Analysts have to sift through 250,010 candidates to find the


10 real cases.

You might also like