Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Introduction to Data Science

Dr.M.Dhurgadevi
Associate Professor
Sri Krishna College of Technology
Coimbatore
Outline
Data,Big Data and Challenges
Data Science
◦ Introduction
◦ Why Data Science
Data Scientists
◦ What do they do?
Major/Concentration in Data Science
◦ What courses to take.
Data All Around
Lots of data is being collected
and warehoused
◦ Web data, e-commerce
◦ Financial transactions, bank/credit transactions
◦ Online trading and purchasing
◦ Social Network
How Much Data Do We have?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day
(5/2009)
1000 genomes project: 200 TB
 Cost of 1 TB of disk: $35
 Time to read 1 TB disk: 3 hrs
(100 MB/s)
Introduction to Big Data
 What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical recording media.
 What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe
a collection of data that is huge in volume and yet growing exponentially
with time. In short such data is so large and complex that none of the
traditional data management tools are able to store it or process it efficiently.

 “Extremely large data sets that may be analyzed computationally to reveal


patterns , trends and association, especially relating to human behavior and
interaction are known as Big Data.”

MODULE 1 Introduction to Big Data


 Examples Of Big Data
Following are some the examples of Big Data-
 The New York Stock Exchange generates about one terabyte of new
trade data per day.

MODULE 1 Introduction to Big Data


 Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges,
putting comments etc.
TWITTER

 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight


time. With many thousand flights per day, generation of data reaches up to
many Petabytes.

MODULE 1 Introduction to Big Data


Big Data
Big Data is any data that is expensive to manage and
hard to extract value from
◦ Volume
 The size of the data
◦ Velocity
 The latency of data processing relative to the growing
demand for interactivity
◦ Variety and Complexity
 the diversity of sources, formats, quality, structures.
Big Data
Types of Data We Have
Relational Data
(Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
What To Do With These Data?
Aggregation and Statistics
◦ Data warehousing and OLAP
Indexing, Searching, and Querying
◦ Keyword based search
◦ Pattern matching (XML/RDF)
Knowledge discovery
◦ Data Mining
◦ Statistical Modeling
What is Data Science?
An area that manages, manipulates,
extracts, and interprets knowledge from
tremendous amount of data
Data science (DS) is a multidisciplinary
field of study with goal to address the
challenges in big data
Data science principles apply to all data –
big and small
What is Data Science?
Theories and techniques from many fields and
disciplines are used to investigate and analyze a
large amount of data to help decision makers in
many industries such as science, engineering,
economics, politics, finance, and education
◦ Computer Science
 Pattern recognition, visualization, data warehousing, High
performance computing, Databases, AI
◦ Mathematics
 Mathematical Modeling
◦ Statistics
 Statistical and Stochastic modeling, Probability.
Data Science
Data Science Process
Data Science Applications
 Healthcare: Data science can identify and predict disease, and
personalize healthcare recommendations.
 Transportation: Data science can optimize shipping routes in real-
time.
 Sports: Data science can accurately evaluate athletes’ performance.
 Government: Data science can prevent tax evasion and predict
incarceration rates.
 E-commerce: Data science can automate digital ad placement.
 Gaming: Data science can improve online gaming experiences.
 Social media: Data science can create algorithms to pinpoint
compatible partners.
 Fintech: Data science can help create credit reports and financial
profiles, run accelerated underwriting and create predictive models
based on historical payroll data.
Real Life Examples
1. IDENTIFYING CANCER TUMORS
 A tool, LYNA, for identifying breast cancer tumors that metastasize to nearby lymph
nodes. LYNA — short for Lymph Node Assistant —accurately identified metastatic
cancer 99 percent of the time using its machine-learning algorithm.
2. TRACKING MENSTRUAL CYCLES
 Behind the scenes, data scientists mine this wealth of anonymized data with tools
like Python and Jupyter’s Notebook.
3. PERSONALIZING TREATMENT PLANS
 Oncora’s software uses machine learning to create personalized recommendations for
current cancer patients based on data from past ones. Their radiology team
collaborated with Oncora data scientists to mine 15 years’ worth of data on
diagnoses, treatment plans, outcomes and side effects from more than 50,000 cancer
records. Based on this data, Oncora’s algorithm learned to suggest personalized
chemotherapy and radiation regimens.
4. CLEANING CLINICAL TRIAL DATA
 Veeva is a cloud software company that provides data and software solutions for the
healthcare industry. The company’s reach extends through clinical, regulatory and
commercial medical fields. Veeva’s Vault EDC uses data science to clean clinical
trial findings and help medical professionals make adjustments mid-study.
Real Life Examples
5. MODELING TRAFFIC PATTERNS
 StreetLight uses data science to model traffic patterns for cars, bikes and pedestrians
on North American streets. Based on a monthly influx of trillions of data points from
smartphones, in-vehicle navigation devices and more, Streetlight’s traffic maps stay
up-to-date. The company’s maps inform various city planning enterprises, including
commuter transit design.

6. OPTIMIZING FOOD DELIVERY


 The data scientists at UberEats have a fairly simple goal:
getting hot food delivered quickly. Making that happen across the country though,
takes machine learning, advanced statistical modeling and staff meteorologists.

7. IMPROVING PACKAGE DELIVERY


 UPS uses data science to optimize package transport from drop-off to delivery. The
company’s integrated navigation system ORION helps drivers choose over 66,000
fuel-efficient routes. ORION has saved UPS approximately 100 million miles and 10
million gallons of fuel per year with the use of advanced algorithms, AI and machine
learning.
Data Scientists
Data Scientist
◦ The excellent Job of the 21st Century
They find stories, extract knowledge.
They are not reporters
Data Scientists
Data scientists are the key to realizing the
opportunities presented by big data. They
bring structure to it, find compelling
patterns in it, and advise executives on the
implications for products, processes, and
decisions.
What do Data Scientists do?
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
Concentration in Data Science
Mathematics and Applied Mathematics
Applied Statistics/Data Analysis
Solid Programming Skills (R, Python, Julia, SQL)
Data Mining
Data Base Storage and Management
Machine Learning and discovery

You might also like