Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Research Methods in Data Science

MISSION VISION CORE VALUES


CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness
holistic development to make effective contribution to Love of Fellow Beings
the society in a dynamic environment Social Responsibility | Pursuit of Excellence
CHRIST
Deemed to be University
Unit - II
Introduction to Data Science
Definition – Big Data and Data Science Hype – Why data science – Getting Past the
Hype – The Current Landscape – Who is a Data Scientist? - Data Science Process
Overview – Defining goals – Retrieving data – Data preparation – Data exploration –
Data modeling – Presentation
Sampling, Measurement and Scaling Techniques
Sampling: Steps in Sampling Design, Different Types of Sample Designs,
Measurement and Scaling: Measurement in Research, Measurement Scales,
Technique of Developing Measurement Tools, Scaling, Important Scaling Techniques.

• Davy Cielen and Arno Meysman, Introducing Data Science. Simon and Schuster, 2016
• C. R. Kothari, Research Methodology Methods and Techniques. 3rd. ed. New Delhi: New Age
International Publishers, Reprint 2014

Excellence and Service


CHRIST
Deemed to be University
Unit – II
Part-I

Introduction to Data Science


Definition – Big Data and Data Science Hype – Why data science – Getting Past
the Hype – The Current Landscape – Who is a Data Scientist? - Data Science
Process Overview – Defining goals – Retrieving data – Data preparation – Data
exploration – Data modeling – Presentation

• Davy Cielen and Arno Meysman, Introducing Data Science. Simon and Schuster, 2016

Excellence and Service


CHRIST
Deemed to be University
Big Data

• Big data is a blanket term for any collection of data sets so large or complex
that it becomes difficult to process them using traditional data management
techniques such as, for example, the RDBMS (Relational Database Management
Systems).
• Big data is important because it can be used to gain insights into a wide variety
of areas, including business, healthcare, and government. It can also be used to
improve decision making, predict trends, and identify new opportunities.
• Data science involves using methods to analyze massive amounts of data and
extract the knowledge it contains.
• Data science and big data evolved from statistics and traditional data
management but are now considered to be distinct disciplines.

Excellence and Service


CHRIST
Deemed to be University
Big Data
Big data poses a number of challenges, including:
• Storage: Big data requires a lot of storage space. Traditional data storage methods, such as relational
databases, are not scalable enough to handle the massive amounts of data generated by big data
applications. Cloud-based storage solutions are a more scalable option, but they can be expensive.
• Processing: Big data is difficult to process using traditional data processing methods. Traditional data
processing methods are designed to process small batches of data in a sequential manner. Big data,
on the other hand, is often streaming in real time and needs to be processed in parallel. This requires
specialized hardware and software.
• Analysis: Big data is difficult to analyze and extract insights from. Traditional data analysis methods
are not designed to handle the volume, velocity, and variety of big data. This requires new data
analysis methods that can scale to handle big data and that can deal with unstructured data.
• Security: Big data is a valuable asset that is vulnerable to security threats. Big data applications often
collect sensitive data, such as personal information and financial data. This data needs to be
protected from unauthorized access, use, disclosure, disruption, modification, or destruction.
• Skills: There is a shortage of skilled big data professionals. Big data requires a variety of skills,
including data engineering, data science, and machine learning. There are not enough people with
these skills to meet the demand for big data talent.
• Organizational resistance: Some organizations are resistant to change and are reluctant to adopt big
data technologies. This can be a major barrier to the adoption of big data.

Excellence and Service


CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework

3 V’s 7 V’s

4 V’s

The "Seven V's of Big Data" is a framework


used to describe the key characteristics that
define the challenges and opportunities
associated with big data. These
characteristics help illustrate why traditional
data processing methods and tools are often
insufficient for handling and making sense of
large and complex datasets.

Excellence and Service


CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework

• Volume: This refers to the sheer scale of data generated and collected. Big data involves massive
amounts of information that exceed the processing capacity of conventional databases and tools. The
volume of data is measured in terms of petabytes, exabytes, and beyond.
o Scenario: Social Media Data Analysis for Marketing Description: A marketing company is analyzing
social media data to understand consumer sentiment towards a new product launch. They collect and
process millions of tweets, comments, and posts in real-time to gauge public reactions and identify
potential areas for improvement.

• Velocity: Velocity pertains to the speed at which data is generated, processed, and delivered. With the
advent of real-time data streams from sources like social media, sensors, and financial markets,
organizations need to analyze and act upon data in near-real-time to capitalize on opportunities and
respond to challenges.
o Scenario: Stock Market Real-time Analysis. Description: An investment firm is monitoring stock market
data in real time to make informed trading decisions. They process and analyze market data streams,
such as price fluctuations and trading volumes, to identify trends and execute buy/sell orders swiftly.
Excellence and Service
CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework

• Variety: Variety refers to the diverse types of data that big data encompasses. This includes structured
data (such as relational databases), unstructured data (such as text and images), and semi-structured
data (such as XML files). Managing and extracting insights from this varied data requires specialized
techniques and tools.
o Scenario: Healthcare Data Integration. Description: A hospital is integrating various types of patient
data, including structured electronic health records (EHRs), unstructured doctor's notes, and medical
images. They use advanced analytics to correlate different data types to provide personalized treatment
plans.

• Veracity: Veracity refers to the accuracy and reliability of data. In the big data context, data can come
from numerous sources, each with varying levels of accuracy and trustworthiness. Ensuring data quality
and addressing issues like inconsistencies and errors become critical to making reliable decisions.
o Scenario: Fraud Detection in Financial Transactions. Description: A credit card company is analyzing a
massive volume of transaction data to detect fraudulent activities. They use machine learning algorithms
to identify patterns that indicate potentially fraudulent transactions while reducing false positives.
Excellence and Service
CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework

• Value: The value of big data lies in its potential to provide meaningful insights and drive informed decisions.
Extracting value from big data involves analyzing and interpreting the data to uncover patterns, trends,
correlations, and insights that can lead to improved business strategies, innovations, and efficiencies.
o Scenario: Retail Customer Behavior Analysis Description: An e-commerce company is analyzing customer
behavior data to improve sales and marketing strategies. They analyze browsing history, purchase patterns,
and demographics to personalize recommendations, promotions, and advertisements, ultimately driving
higher conversion rates.

• Variability: Variability refers to the inconsistency of data flows, which can be erratic and unpredictable. Data
can arrive in irregular intervals, and its structure can change over time. Handling variability requires flexible
data processing techniques and tools that can adapt to changing data patterns.
o Scenario: Weather Forecasting and Emergency Response Description: A meteorological agency is collecting
and processing weather data from various sources, including satellites, sensors, and weather stations. They
handle the variability in data frequency and format to provide accurate and timely weather forecasts for
disaster preparedness.

Excellence and Service


CHRIST
Deemed to be University
Current Landscape of Big Data – Characteristics / Framework

• Visibility (Visualization): Visibility refers to the ability to access and understand data from
various perspectives. Effective visualization and data presentation techniques are crucial
to making complex data comprehensible and actionable for a wide range of stakeholders.
o Scenario: Supply Chain Analytics for Manufacturing. Description: A manufacturing
company is using big data analytics to gain visibility into its supply chain. They track data
from suppliers, production facilities, transportation, and distribution centers to optimize
inventory levels, reduce lead times, and enhance overall operational efficiency.

These Seven V's collectively emphasize the challenges and opportunities presented by big data.
Organizations that can successfully address these characteristics can harness the power of big
data to gain insights, improve decision-making, drive innovation, and enhance their competitive
advantage.

Excellence and Service


CHRIST
Deemed to be University
Big Data Landscape - Technologies
• The big data landscape consists of many different technologies that can be
categorized into the following:
– File system
– Distributed programming frameworks
– Data integration
– Databases
– Machine learning
– Security
– Scheduling
– Benchmarking
– System deployment
– Service programming
Excellence and Service
CHRIST
Deemed to be University
Facets of Data
Structured
• Definition: Structured data is organized with a specific format, schema, and
clear relationships between data elements. It's typically stored in databases
using rows and columns, making it easily queryable and analyzable.
• Examples: Customer information, sales transactions, product inventory,
financial records.

Excellence and Service


CHRIST
Deemed to be University
Facets of Data
Unstructured
• Definition: Unstructured data lacks a fixed format or schema. It
comes in various forms, making it more flexible but harder to
analyze compared to structured data. Specialized tools are often
required for meaningful insights.
• Examples: Textual content (documents, emails), images, social
media posts, videos.

Natural language
• Definition: Natural language data refers to text or speech data
generated by humans in their everyday communication.
Analyzing natural language involves techniques like sentiment
analysis, language translation, and text summarization.
• Examples: Social media comments, customer reviews, email
correspondence.

Excellence and Service


CHRIST
Deemed to be University
Facets of Data
Machine-generated
• Definition: Machine-generated data is produced by
automated systems, devices, or sensors without human
intervention. It's often generated at high volumes and high
velocities.
• Examples: Sensor data from IoT devices, server logs, GPS data.

Graph-Based Data:
• Definition: Graph-based data represents relationships
between entities using nodes (vertices) and edges. It's
especially useful for modeling complex interactions and
networks.
• Examples: Social networks (nodes as users, edges as
connections), supply chain networks, knowledge graphs.

Excellence and Service


CHRIST
Deemed to be University
Facets of Data
Audio, video, and images
• Definition: These data types include multimedia content like audio recordings, video clips,
and images. Analyzing such data often involves computer vision and audio processing
techniques.
• Examples: Surveillance camera footage, medical imaging, YouTube videos.

Streaming Data
• Definition: Streaming data refers to real-time data that is generated, processed, and
analyzed as it is produced. It's crucial for applications that require immediate insights and
actions.
• Examples: Stock market tick data, social media live feeds, IoT sensor data.

Excellence and Service


CHRIST
Deemed to be University
Data Science
• Data Science is an interdisciplinary field that involves the use of various
techniques, algorithms, processes, and systems to extract insights and
knowledge from structured and unstructured data.
• It combines expertise from domains such as statistics, computer science,
domain knowledge, and data engineering to analyze complex data sets and
solve real-world problems.
• Data science is widely applicable across industries and sectors, from
business and healthcare to finance and social sciences. It aids in solving a
range of challenges, such as predicting customer preferences, detecting
fraud, optimizing supply chains, analyzing medical data, and more. The field
is constantly evolving, driven by advancements in technology and the
growing availability of data.
Excellence and Service
CHRIST
Deemed to be University
Data Scientist

• Professionals in data science, often referred to as data scientists, possess a


blend of technical skills (programming, statistics, machine learning) and
domain-specific knowledge.
• They leverage data to generate insights that inform decision-making, enhance
processes, and create innovative solutions. The ability to extract meaningful
information from data is central to data science's role in addressing complex
and data-rich challenges in the modern world.
• Data scientists should possess a unique combination of technical prowess,
critical thinking, and interpersonal skills, allowing them to extract valuable
insights from data and contribute significantly to an organization's success.

Excellence and Service


CHRIST
Deemed to be University
Data Scientist – Key Qualities
• Analytical Mindset
• Strong Programming Skills (Python/R)
• Statistical Knowledge
• Machine Learning Expertise
• Domain Knowledge
• Data Wrangling
• Data Visualization
• Problem-Solving
• Continuous Learning
• Communication Skills
• Ethical Considerations
• Curiosity
• Attention to Detail
• Business Acumen
• Team Player
Excellence and Service
CHRIST
Deemed to be University
Data Science - Process
• Problem Definition
• Data Collection
• Data Cleaning and Preprocessing
• Exploratory Data Analysis (EDA)
• Feature Engineering
• Model Selection and Training
• Model Evaluation
• Model Deployment
• Interpretation and Visualization
• Iterative Improvement
• Communication and Presentation
The data science process is not always strictly linear, and iterations may occur between different stages as new insights
are discovered or challenges are encountered. Successful data science projects require collaboration between domain
experts, data engineers, and data scientists to ensure that the entire process yields valuable and actionable results.

Excellence and Service


CHRIST
Deemed to be University
Unit – II
Part-II

Sampling, Measurement and Scaling Techniques


Sampling: Steps in Sampling Design, Different Types of Sample Designs,
Measurement and Scaling: Measurement in Research, Measurement Scales,
Technique of Developing Measurement Tools, Scaling, Important Scaling
Techniques.

• C. R. Kothari, Research Methodology Methods and Techniques. 3rd. ed. New Delhi: New Age
International Publishers, Reprint 2014

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Deliberate sampling: Deliberate sampling is also known as purposive or non-probability
sampling. This sampling method involves purposive or deliberate selection of particular units
of the universe for constituting a sample which represents the universe. When population
elements are selected for inclusion in the sample based on the ease of access, it can be called
convenience sampling. On the other hand, in judgement sampling the researcher’s
judgement is used for selecting items which he considers as representative of the population.
Scenario: Market Research for a New Product Launch

A technology company is preparing to launch a new smartphone model targeted at a specific


demographic: tech-savvy professionals aged 25 to 40 who are frequent travelers. The company
wants to conduct market research to gather insights into this target audience's preferences,
expectations, and purchasing behaviors.

In this scenario, deliberate sampling allows the company to focus its market research efforts on a
specific demographic that is crucial for the success of its new product. While deliberate sampling
doesn't provide the statistical representativeness of probability sampling, it offers valuable
qualitative insights that can guide decision-making in product development and marketing
strategies.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Simple random sampling: This type of sampling is also known as chance sampling or
probability sampling where each and every item in the population has an equal chance of
inclusion in the sample and each one of the possible samples, in case of finite universe,
has the same probability of being selected.

Scenario: Political Opinion Poll in a City

Imagine a scenario where a research firm wants to conduct a political opinion poll to gauge the
sentiments of the residents of a city regarding upcoming local elections. The firm aims to use
simple random sampling to ensure that each eligible resident has an equal chance of being
included in the survey.

Simple random sampling in this scenario ensures that every registered voter has an equal
chance of being included in the survey, minimizing potential bias and providing a representative
sample of the population's political opinions.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Systematic sampling: In some instances the most practical way of sampling is to select
every 15th name on a list, every 10th house on one side of a street and so on. Sampling of
this type is known as systematic sampling. An element of randomness is usually introduced
into this kind of sampling by using random numbers to pick up the unit with which to start.
This procedure is useful when sampling frame is available in the form of a list.

Scenario: Customer Feedback Collection in a Retail Store

Consider a scenario where a retail store wants to collect customer feedback to understand their
satisfaction levels and improve their services. The store aims to use systematic sampling to
efficiently gather feedback from customers while maintaining randomness in the selection process.

Systematic sampling in this scenario allows the retail store to collect customer feedback in an
organized and efficient manner while maintaining a level of randomness. It ensures that feedback is
gathered from a variety of customers, providing insights that can lead to improvements in the
store's operations and customer satisfaction.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Stratified sampling: If the population from which a sample is to be drawn does not constitute a
homogeneous group, then stratified sampling technique is applied so as to obtain a
representative sample. In this technique, the population is stratified into a number of non-
overlapping subpopulations or strata and sample items are selected from each stratum. If the
items selected from each stratum is based on simple random sampling the entire procedure, first
stratification and then simple random sampling, is known as stratified random sampling.

Scenario: Educational Assessment of Schools in a District

Imagine a scenario where a district school authority wants to assess the academic performance of
its students across different grade levels and subjects. The district authority aims to use stratified
sampling to ensure representation from each grade level and subject area while maintaining a
manageable sample size.

Stratified sampling in this scenario allows the district authority to obtain a representative sample of
students from each grade level and subject area. This ensures that the assessment results
accurately reflect the academic performance of students across the entire school district, enabling
effective educational planning and targeted interventions.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques

• Multi-stage sampling: This is a further development of the idea of cluster sampling. This
technique is meant for big inquiries extending to a considerably large geographical area like
an entire country. Under multi-stage sampling the first stage may be to select large primary
sampling units such as states, then districts, then towns and finally certain families within
towns.
Scenario: Environmental Impact Assessment in a Region

Imagine a scenario where an environmental agency is conducting an assessment of the


ecological impact of a construction project in a large region. The agency wants to use multi-stage
sampling to efficiently gather data from different areas while considering the diverse ecosystems
present i.e. they might randomly choose 10 urban zones, 8 rural zones, 5 forested zones, and 3
wetland zones.

Multi-stage sampling in this scenario allows the environmental agency to efficiently gather data
from various ecosystems within the region while maintaining a representative sample. By
considering different geographical zones and sub-areas, they can assess the potential impact of
the construction project on the environment more comprehensively.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques

• Sequential sampling: This is somewhat a complex sample design where the ultimate size
of the sample is not fixed in advance but is determined according to mathematical decisions
on the basis of information yielded as survey progresses. This design is usually adopted
under acceptance sampling plan in the context of statistical quality control

Scenario: Quality Control in a Manufacturing Plant

Consider a scenario where a manufacturing plant produces electronic components and wants to
ensure the quality of its products. The plant implements sequential sampling to monitor the
production process and make real-time decisions about the quality of the components.

In this scenario, sequential sampling is utilized to make ongoing decisions about the quality of
manufactured components. It enables the manufacturing plant to quickly identify and rectify
quality issues, leading to higher product quality and reduced waste.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Quota sampling: In stratified sampling the cost of taking random samples from individual
strata is often so expensive that interviewers are simply given quota to be filled from
different strata, the actual selection of items for sample being left to the interviewer’s judgement. This is called
quota sampling.

Scenario: Consumer Preference Survey for a New Beverage

Imagine a scenario where a beverage company wants to conduct a consumer preference survey
to understand which flavors of a new drink are most popular among different age groups and
genders. The company decides to use quota sampling to ensure a balanced representation of
participants from various demographic categories.

In this scenario, quota sampling helps the beverage company gather insights about consumer
preferences across different demographic categories without conducting a fully random sample.
While quota sampling doesn't guarantee statistical representativeness like probability sampling, it
allows for a certain level of control over the composition of the sample to ensure diversity and
balance.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Cluster sampling and area sampling: Cluster sampling involves grouping the population
and then selecting the groups or the clusters rather than individual elements for inclusion in
the sample. Suppose some departmental store wishes to sample its credit card holders. It
has issued its cards to 15,000 customers. The sample size is to be kept say 450. For cluster sampling this
list of 15,000 card holders could be formed into 100 clusters of 150 card holders each. Three clusters might
then be selected for the sample randomly.

Scenario: Healthcare Facilities Assessment in a Region

Imagine a scenario where a government health department wants to assess the quality of
healthcare facilities in a large region with many hospitals and clinics. The department decides to
use cluster sampling to evaluate a representative subset of healthcare facilities.

In this example, cluster sampling allows the health department to evaluate the quality of
healthcare facilities in the region without having to assess each facility individually. By selecting
representative clusters and conducting assessments within them, the department can make
informed decisions to improve healthcare services.

Excellence and Service


CHRIST
Deemed to be University

Sampling Techniques
• Area sampling is quite close to cluster sampling and is often talked about when the total
geographical area of interest happens to be big one. Under area sampling we first divide
the total area into a number of smaller non-overlapping areas, generally called geographical
clusters, then a number of these smaller areas are randomly selected, and all units in these
small areas are included in the sample. Area sampling is specially helpful where we do not
have the list of the population concerned.

Scenario: Urban Air Quality Assessment in a City


Imagine a scenario where an environmental agency wants to assess the air quality in a large and
densely populated city. The agency decides to use area sampling to gather data on air pollution
levels across different neighborhoods within the city.

In this example, area sampling allows the environmental agency to assess urban air quality
efficiently across a diverse city landscape. By selecting representative neighborhoods and
measuring air pollution within those areas, they can make informed decisions to address
environmental concerns and enhance the overall quality of life for city residents.

Excellence and Service

You might also like