Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Unit I Introduction to Data Science and Big Data

1.1Basic and need of Data Science and Big Data


Data Science only the stuff going on in companies like Google and Facebook
and tech companies? Why do many people refer to Big Data as crossing
disciplines (astronomy, finance, tech, etc.) and to data science as only taking place
in tech? just how big is big? Or is it just a relative term? These terms are so
ambiguous, they’re well-high meaningless.
Here’s a distinct lack of respect for the researchers in academia and industry
labs who have been working on this kind of stuff for years, and whose work is
based on decades(in some cases, centuries) of work by statisticians, computer
scientists, mathematicians, engineers, and scientists of all types. From the way
the media describes it, machine learning algorithms were just invented last week
and data was never “big” until Google came along.
Statisticians already feel that they are studying and working on the “Science
of Data”. That’s their bread and butter. Mabe you, dear reader, are not a statistician
and don’t care, but image that for the statistician, this feels a little bit like how
identity theft might feel for you.
Although we will make the case that data science is not just a rebranding of
statistics or machine learning but rather a field unto itself, the media often
describes data science in a way that makes it sound like as if it’s simply statistics
or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a science isn’t.”
Although there might be truth in there, that doesn't mean that the terms “data
science” itself represents nothing, but of course what it represents may not be
science but more of a craft.
Fig. 1: Architecture of big data and data science
There are many debates as to whether data science is a new field. Many
argue that similar practices have been used and branded as statistics, analytics,
business intelligence, and so forth. In either case, data science is a very popular
and prominent term used to describe many different data-related processes and
techniques that will be discussed here.
Big data on the other hand is relatively new in the sense that the amount of
data collected and the associated challenges continue to require new and
innovative hardware and techniques for handling it.
This article is meant to give the non-data scientist a solid overview of the
many concepts and terms behind data science and big data. While related terms
will be mentioned at a very high level, the reader is encouraged to explore the
references and other resources for additional detail. Another post will follow as
well that will explore related technologies, algorithms, and methodologies in
much greater detail.
Need of data science and big data
Need of Data Science:
The main goal of data science is to discover patterns in data. It analyses
and draws conclusions from the data using a variety of statistical approaches.
A data scientist must evaluate the data extensively from data extraction
through wrangling and pre-processing.
Then it’s up to him to make forecasts based on the data. A Data Scientist’s
mission is to draw conclusions from data. He is able to assist businesses in
making better business decisions as a result of his findings.
Data is essential to propel the movement forward in everything from
business to the health industry, science to our daily lives, marketing to research
and so on. Computer Science and Information Technology have taken over
our lives, and they are progressing at such a rapid and diverse rate that the
operational procedures utilized just a few years ago are now useless.
Challenges and issues are in the same boat. In terms of complexity, the
challenges and worries of the past for a certain theme, ailment, or deficiency
may not be the same now.
To stay up with the difficulties of today and tomorrow, as well as to find
answers to unresolved issues, every field of science and study, as well as every
company, requires an updated set of operational systems and technologies.
Need of Big Data:
The value of big data isn’t solely determined by the amount of data available.
Its worth is determined by how you use it. You can get answers that 1)
streamline resource management,
2) increase operational efficiencies, 3) optimise product development, 4) drive
new revenue and growth prospects, and 5) enable smart decision making by
evaluating data from any source. When big data and high-performance
analytics are combined, you can do business-related tasks such as:
• In near-real time, determining the root causes of failures, difficulties, and
flaws.
• Anomalies are detected faster and more correctly than the naked eye.
• Improving patient outcomes by transforming medical picture data into
insights as quickly as possible.
• In minutes, whole risk portfolios can be recalculated.
• Increasing the ability of deep learning models to effectively categorise and
respond to changing variables.
• Detecting fraudulent activity before it has a negative impact on your
company.
1.2Application of Data Science:
Data Science Application haven’t taken on a new function overnight. We can
now forecast outcomes in minutes, which used to take many human hours to
process, because to faster computing and cheaper storage.
A Data Scientist earns a remarkable $124,000 per year, thanks to a scarcity of
qualified workers in this industry. Python of Data Science Certification are at
an all-time high because of this!
10 apps that build on Data Science concepts and explore a variety of domains,
including:
Fraud and Risk Detection:
Finance was one of the first industries to use data science. Every year,
businesses were fed up with bad loans and losses. They did, however, a lot of
data that was acquired during the first filling for loan approval. They decided
to hire data scientists to help them recover from their losses.
Banking business have learned to divide and conquer data over time using
consumer proofing, historical spending, and other critical indicators to assess
risk and default possibilities. Furthermore, it aided them in promoting their
banking products depending on the purchasing power pf their customers.
Healthcare:
Data Science applications are very beneficial to the healthcare industry:
1. Medical Image Analysis:
To identify appropriate parameters for jobs like lung texture categorization,
procedures like detecting malignancies, artery stenosis, and organ
delineation use a variety of approaches and frameworks like MapReduce.
For solid texture classification, it uses machine learning techniques such as
support vector machine(SVM), content-based medical picture indexing,
and wavelet analysis.
2. Drug Development:
The drug discovery process is quite complex and entails a wide range of
professions. The best ideas are frequently circumscribed by billions of
dollars in testing, as well as significant money and time commitment. An
formal submission takes an average of twelve years.
From the first screening of therapeutic compounds through the prediction
of the success rate based on biological parameters, data science
applications and machine learning algorithms simplify and shorten this
process, bringing a new viewpoint to each step. Instead of “lab
experiments,” these algorithms can predict how the substance will operate
in the body using extensive is to develop computer modelling and
simulations. The goal of computational drug discovery is to develop
computer model simulations in the form of a physiologically appropriate
network, which makes it easier to anticipate future outcomes with high
accuracy.
3. Genetics & Genomics:
Trough genetics and genomics research, Data Science applications also
provide a higher level of therapy customisation. The goal is discover
specific biological linkages between genetics, illnesses, and treatment
response in order to better understand the impact of DNA on our health.
Data Science tools enable the integration of various types of data with
genomic data in illness research, allowing for a better understanding of
genetic concerns in medication and disease reactions. We will have a better
grasp of human DNA as soon as we have solid personal genome data.
Advanced genetic risk prediction will be a significant step toward more
personalised care.
Internet Search:
When you think about Data Science Application, this is usually the first thing that
comes to mind.
When we think of search, we immediately think of Google. Right? However, ther
are other search engines, such as Yahoo, Bing, Ask, and others. Data Science
techniques are used by all of these search engines(including Google) to offer the
best result for our searched query in a matter of seconds. In light of the fact that
Google processes over 20 petabytes of data per day.
Targeted Advertising:
If you thought Search was the most important data science use, consider this: the
full digital marketing spectrum. Data Science algorithms are used to determine
practically anything, from display banners on various websites to digital
billboards at airports.
This is why digital advertisements have a far greater CTR (Call-Through Rate)
than traditional advertisements. They can be tailored to a user’s previous actions.
This is why you may see adverts for Data Science Training Programs while I see
an advertisement for apparels in the same spot at the same time.
Website Recommendations:
Aren’t we all used to Amazon’s suggestions for similar products? They not only
assist you in locating suitable products from the billions of products accessible,
but they also enhance the user experience.
Many businesses have aggressively employed this engine to market their products
based on user interest and information relevance. This technique is used by
internet companies such as Amazon, Twitter, Google Play, Netflix, LinkedIn,
imdb, and many others to improve the user experience. The recommendations are
based on a user’s previous search results.
Advanced Image Recognition:
You share a photograph on Facebook with your pals, and you start receiving
suggestions to tag your friends. Face recognition method is used in this automatic
tag recommendation function.
Facebook’s recent post details the extra progress they’ve achieved in this area,
highlighting their improvements in image recognition accuracy and capacity.
Speech Recognition:
Google Voice, Siri, Cortana, and other speech recognition products are some of
the best examples. Even if you are unable to compose a message, your life will
not come to a halt if you use the speech-recognition option. Simply say the
message out loud, and it will be transformed to text. However, you will notice
that voice recognition is not always correct.
Airline Route Planning:
The airline industry has been known to suffer significant losses all over the world.
Companies are fighting to retain their occupancy ratios and operational earnings,
with the exception of a few aviation service providers. The issue has worsened
due to the huge rise in air-fuel prices and the requirement to give significant
discounts to clients. It wasn’t long before airlines began to use data science to
pinpoint important areas for development. Airlines can now, thanks to data
science, do the following:
• Calculate the likelihood of a flight delay.
• Choose the type of plane you want to buy.
• Whether to land at the destination immediately or make a stop in between
(for example, a flight from New Delhi to New York can take a straight
route). It can also opt to come to a halt in any country.
• Drive consumer loyalty programmes effectively.
Gaming:
Machine learning algorithms are increasingly used to create games that
develop and upgrade as the player progresses through the levels. In motion
gaming, your opponent(Computer) also studies your previous moves and
adjusts its game accordingly. EA Sports, Zynga, Sony, Nintendo, and
Activision-Blizzard have all used data science to take gaming to the next level.
Augmented Reality:
This is the last of the data science applications that appears to have the most
potential in the future. Augmented reality is a term that refers to the use of
technology.
1.3Data explosion:
• Parallel to expansion in service offerings of IT companies, there is
growth in another environment – the data environment. The volume of
data is practically exploding by the day. Not only this, the data that is
available now in becoming increasingly unstructured. Statistics from
IDC state that 2011 will see global data grow by up to 44 times
amounting to a massive 35.2 zettabytes(ZB-a billion terabytes).
• These factors, coupled with the need for real-time data, constitute the
“Big-Data” environment. How can organizations stay afloat in the big
data environment? How can they manage this copious amount of data.
• I believe a three-tier approach to managing big data would be the key
the first tier to handle structured data, the second involving appliances
for real-time processing and the third for analysing unstructured
content. Can this structure be tailored for your organization?
• No matter what the approach might be, organizations need to create a
cost effective method that provides a structure to big data. According to
a report by McKinsey & Company, accurate interpretation of Big Data
can improve retail operating margins by as much as 60 %. This is where
information management comes in.
• Information management is viral to be able to summarise the data into
a manageable and understandable form. It is also needed to extract
useful and relevant data from the large pool that is available and to
standardize the data. With information management, data can be
standardized in a fixed form. Standardized data can be used to find
underlying patterns and trends.
• Statistics say that the United States alone could face a shortage of
140,000 to 190,000 persons with requisite analytic and decision-making
skills by 2018. Organizations are now looking for partners for effective
information management to form mutually beneficial long sighted
arrangements.
• The challenge before the armed forces is to develop tools that enable
extraction of relevant information from the data for mission planning
and intelligence gathering. And for that, armed forces require data
scientists like never before.
• Big Data describes a massive volume of both structured and
unstructured data. This data is so large that it is difficult to process using
traditional database and software techniques. While the term refers to
the volume of data, it includes technology, tools and processes required
to handle the large amounts of data and storage facilities.
1.4V’s of Big Data:
In recent years, the “3Vs” of Big Data have been replaced by the “5Vs”, which
are also known as the characteristics of Big Data and are as follows:
1) Volume
• Volume refers to the amount of data generated through websites,
portals and online applications. Especially for B2C companies,
Volume encompasses the available data that are out there and need
to be assessed for relevance.
• Volume defines the data infrastructure capability of an
organization’s storage, management and delivery of data to end users
and applications. Volume focuses on planning current and future
storage capacity – particularly as it relates to velocity -but also in
reaping the optimal benefits of effectively utilising a current storage
infrastructure.
• Volume is the V most associated with big data because, well, volume
can be big. What we’re talking about here is quantities of data that
reach almost incomprehensible proportions.
• Facebook, for example, stores photographs. That statement doesn’t
begin to boogle the mind until you start to realize that facebook has
more users than China has people. Each of those users has stored a
whole lot of photographs. Face book is storing roughly 250 billion
images.
• Try to wrap your head around 250 billion images. Try this one. As
far back as 2016, Facebook had 2.5 trillion posts. Seriously, that’s a
number so big it’s pretty much impossible to picture.
• So, in the world of big data, when we start talking about volume,
we're talking about insanely large amounts of data. As we
move forward, we're going to have more and more huge
collections. For example, as we add connected sensors to pretty
much everything, all that telemetry data will add up.
• How much will it add up? Consider this. Gartner, Cisco, and Intel
estimate there will be between 20 & 200 (no, they don’t agree,
surprise!) connected IoT devices, the number are huge no matter
what. But it's not just the quantity of devices.
• Consider how much data is coming off of each one. I have a
temperature sensor in my garage. Even with a one-minute level of
granularity (one measurement a minute), that’s still 525,950 data
points in a year, and that’s just one sensor. Let’s say you have a
factory with a thousand sensors, you’re looking at half a billion data
points, ust for the temperature alone.
2) Velocity:
• With Velocity we refer to the speed with which data are being
generated. Staying with our social media example, every day 900
million photos are uploaded on Facebook, 500 million tweets are
posted on Twitter, 0.4 million hours of video are uploaded on
YouTube and .5 billion searches are performed in Google.
• This is like a nuclear data explosion. Big Data helps the company to
hold this explosion, accept the incoming flow of data and at the same
time process it fast so that it does not create bottlenecks.
• 250 billion images may seem like a lot. But if you want your mind
blown, consider this: Facebook users upload more than 900 million
photos a day. A day. So that 250 billion number from last year will
seem like a drop in the bucket in a few months.
• Also: Facebook explains Fabric Aggregator, its distributed network
system
• Velocity is the measure of how fast the data is coming in. Face book
has to handle a tsunami of photographs every day. It has to ingest it
all, process it, file it, and somehow, later, be able to retrieve it.
• Here’s another example. Let’s say you’re running a marketing
campaign and you want to know how the folks “out there” are
feeling about your brand right now. How would you do it? One wat
would be to license some Twitter data from Grip (acquired by
Twitter) to grab a constant stream of tweets, and subject them to
sentiment analysis.
• That feed of Twitter data is often called “the firehouse” because so
much data is being produced, it feels like being at the business end
of a firehouse.
• Here’s another velocity example: packet analysis for cyber security.
The Internet sends a vast amount of information across the world
every second. For an enterprise IT team, a portion of that flood has
to travel through firewalls into a corporate network.
3) Variety:
• It refers to the structured, semi-structured, and unstructured data
types.
• It can also refer to a variety of sources.
• Variety refers to the influx of data from new sources both inside and
outside of an organisation. It might be organised, semi-organised, or
unorganised.
• Structured data – is a type of data that is semi-organised. It’s a type
of data that doesn’t follow the traditional data structure. This type of
data is represented by log files.
• Semi-structured data – is a type of data that is semi-organized. It’s
a type of data that doesn’t follow the traditional data structure. This
type of data is represented by log files.
• Unstructured data – is just data that has not been arranged. It
usually refers to data that doesn’t fit cleanly into a relational
database’s standard row and column structure. Texts, pictures,
videos, etc. are examples of unstructured data which can’t be stored
in the form of rows and columns.
4) Veracity:
• It refers to data inconsistencies and uncertainty, i.e., available data
can become untidy at times, and quality and accuracy are difficult to
control.
• Because of the numerous data dimensions originating from multiple
distinct data kinds and sources, Big Data is also volatile.
• For example, a large amount of data can cause confusion, yet a
smaller amount of data can only convey half or incomplete
information.
5) Value:
• After considering the four V’s, there is one more V to consider:
Value! The majority of data with no value is useless to the
organization until it is converted into something beneficial.
• Data is of no utility or relevance in and of itself; it must be turned
into something useful in order to extract information. As a result,
Value! Can be considered the most essential of the five V’s.
1.5Relationship between Data Science and Information Science:
The finding of knowledge or actionable information in data is what data
science is all about. The design of procedures for storing and retrieving
information is known as information science.

Data Science Vs Information Science:


Data Science and information science are two separate but related fields.
Harry is a computer scientist and mathematician who focuses on data
science. Library science, cognitive science, and communications are all areas
of interest in information science.
Business tasks such as strategy formation, decision making, and
operational processes all require data science. It discusses Artificial
Intelligence, analytics, predictive analytics, and algorithms design, among
other topics.
Knowledge management, data management, and interaction design are all
domains where information science is employed.
Data Science Information Science
Definition The finding of The design of
knowledge or procedures for storing
actionable information and retrieving
in data is what data information is known as
science is all about. information science.

1.6Business Intelligence versus Data Science:


Data Science
Data Science is a field in which data is mined for information and knowledge
using a variety of scientific methods, algorithms, and processes. It can thus be
characterized as a collection of mathematical tools, algorithms, statistics, and
machine learning techniques that are used to uncover hidden patterns and insights
in data to aid decision making. Both organised and unstructured data are dealt
with in data science. It has to do with data mining as well as big data. Data Science
is researching historical trends and then applying the findings to reshape current
trends and forecast future trends.
Business Intelligence
Business Intelligence (BI) is a combination of technology, tools, and processes
that businesses utilize to analyse business data. It is mostly used to transform raw
data into useful information that can then be used to make business decisions and
take profitable actions. It is concerned with the analysis of organised and
unstructured data in order to open up new and profitable business opportunities.
It favours fact-based decision making over assumptions-based decisions-making.
As a result, it has a direct impact on a company’s business decisions. Business
intelligence tools improve a company's prospects of entering a new market and
aid in the analysis of marketing activities.
The following table compares and contrasts Data Science and Business
Intelligence
Factor Data Science Business Intelligence
Concept It is a discipline that It is a collection of
employs mathematics, technology, applications,
statistics, and other and processes that
methods to uncover businesses employ to
hidden patterns in data analyse business data
Focus It is centred on the It concentrates on both
future. the past and the present
Data It can handle both It primarily works with
structured and structured data
unstructured data
Flexibility Data Science is more It is less flexible because
adaptable since data data sources for business
sources can be added as intelligence must be
needed. planned ahead of time
Method It employs the scientific It employs the analytic
process method.
Complexity In comparison to When compared to data
business intelligence, it science, it is lot easier.
is more sophisticated.
Expertise Data scientist is its area Its area of specialisation
of competence. is for business users.
Questions It addresses the It is concerned with the
questions of what will question of what
happen and what might occurred.
happen
Tools SAS, BigML, InsightSquared Sales
MATLAB, Excel, and Analytics, Klipfolio,
other programmes are ThoughtSpot, Cyfe,
among its tools TIBCO spotfire, and
more solutions are
among them

You might also like