Data Science2

WHAT IS DATA COLLECTION?
Data collection is the process of gathering, measuring, and analyzing accurate data
from a variety of relevant sources to find answers to problems ,evaluate outcomes, and
forecast trends and probabilities.
During data collection, the researchers must identify the data types, the sources of data,
and what methods are being used.
Data Collection Tools
● Word Association
● Sentence Completion
● Role-Playing
● In-Person Surveys
● Online/Web Surveys
DATA COLLECTION METHODS
These are primary data collection methods

1. Surveys
Surveys are physical or digital questionnaires that gather both

qualitative and quantitative data from subjects.
2. Transactional Tracking
Each time your customers make a purchase, tracking that data can allow you to
make decisions about targeted marketing efforts and understand your customer
base better.
3. Interviews and Focus Groups
Interviews and focus groups consist of talking to subjects face-to-face about a

specific topic or issue. Interviews tend to be one-on-one, and focus groups are
typically made up of several people. You can use both to gather qualitative and
quantitative data.
4. Observation
Observing people interacting with your website or product can be useful for data
collection, if your user experience is confusing then you can witness it in
real-time.
5. Online Tracking
To gather behavioral data, you can implement pixels and cookies. These are both
tools that track users’ online behavior across websites and provide insight.
6. Forms
Online forms are beneficial for gathering qualitative data about users, specifically
demographic data or contact information.
7. Social Media Monitoring
Monitoring your company’s social media channels for follower engagement is an

accessible way to track data about your audience’s interests and motivations.
Secondary data collection
Secondary data is second-hand data collected by other parties and already having
undergone statistical analysis. This data is either information that the researcher has
tasked other people to collect or information the researcher has looked up.
Challenges in Data Collection
Data Quality Issues
Inconsistent Data
Data Downtime
Ambiguous Data
Inaccurate Data
Data Preprocessing
Data preprocessing is an important step in data science.The goal of data
preprocessing is to improve the quality of the data and to make it more suitable
for the specific task.
Some common steps in data preprocessing include:
● Data cleaning: this step involves identifying and removing missing,

inconsistent, or irrelevant data. This can also include removing
duplicate records.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways.
Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:

Noisy data is meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
The whole data is divided into segments of equal size and then various methods
are performed to complete the task. Each segment is handled separately.
Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data transformation: this step involves converting the data into a format that
is more suitable for the task. This can include normalizing numerical data,
creating dummy variables etc.
Normalization:
It is done in order to scale the data values in a specified range.
Attribute Selection:
In this new attributes are constructed from the given set of attributes to help the
process.
Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
data reduction: this step is used to select a subset of the data that is relevant
to the task. This can include feature selection or feature extraction
The various steps to data reduction are:
1. Data Cube Aggregation:

Aggregation operation is applied to data for the construction of the data
cube.
2. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms.It can be lossy
or lossless. The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis).
Data integration: this step involves combining data from multiple

sources, such as databases, spreadsheets, and text files. The goal of
integration is to create a single, consistent view of the data.
Data discretization: this step is used to convert continuous

numerical data into categorical data, which can be used for decision trees
and other categorical data mining techniques.

Data Science2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science2

Uploaded by

Copyright:

Available Formats

WHAT IS DATA COLLECTION?

Data Collection Tools

DATA COLLECTION METHODS

These are primary data collection methods

Surveys are physical or digital questionnaires that gather both

3. Interviews and Focus Groups

Interviews and focus groups consist of talking to subjects face-to-face about a

7. Social Media Monitoring

Monitoring your company’s social media channels for follower engagement is an

Challenges in Data Collection

Data Quality Issues

Some common steps in data preprocessing include:

● Data cleaning: this step involves identifying and removing missing,

Ignore the tuples:

(b). Noisy Data:

Concept Hierarchy Generation:

The various steps to data reduction are:

1. Data Cube Aggregation:

2. Attribute Subset Selection:

Wavelet transforms and PCA (Principal Component Analysis).

Data integration: this step involves combining data from multiple

Data discretization: this step is used to convert continuous

You might also like