Batch 2 - PE31 - Data Preprocessing and Visualisation

(B.
TECH) Semester-VIII AY 2022-23

DL Lab Assignment No. 00
Student Name: Pratyush Jha PRN No.: 1032191000

Date: 11-01-2022 Faculty: Prof. Anita Gunjal
Problem Statement: To revise the pre-requisite of DL and Explore the Wine Quality dataset.
Objectives:
1. To understand the basics & Pre-requisite of Deep Learning (DL).
2. To analyze the functions for exploring the Wine Quality Dataset.
Theory:
● Data Collection
Data collection refers to the process of gathering information from various sources. There are
several techniques associated with data collection, including:
o Surveys: This involves asking a set of questions to a group of people to gather

information about their opinions, experiences, or behaviour. Surveys can be conducted in
person, over the phone, or online.
o Interviews: This involves conducting one-on-one conversations with individuals to gather

information about their experiences, opinions, or behaviour. Interviews can be conducted
in person, over the phone, or online.
o Observations: This involves observing and recording the behaviour of individuals or

groups in their natural environment.
o Experiments: This involves manipulating one or more variables to observe the effect on a
particular outcome.
o Case studies: This involves gathering detailed information about a specific individual,
group, or organization to understand a particular phenomenon.
o Focus groups: This involves gathering a small group of people together to discuss a
specific topic or product.
o Self-report measures: This involves asking individuals to report on their own behavior,
attitudes, or experiences.
o Transactional data: This involves collecting data from financial transactions, such as
purchase history.
o Web scraping: This is a way to extract large amount of data from websites, fast and
automatically.
o Social media scraping: This is similar to web scraping but targeting social media
platforms, like twitter, Facebook, Instagram.
● Pre-processing of Data
Data pre-processing refers to the techniques used to prepare raw data for analysis. It is an
important step in the data science process as it helps ensure that the data is clean, consistent,
and ready for analysis. Some common techniques associated with data pre-processing include:
o Data cleaning: This involves identifying and correcting errors, inconsistencies, and
missing values in the data.
o Data integration: This involves combining data from multiple sources to create a unified
dataset.
o Data transformation: This involves converting data into a format that can be easily
analysed, such as normalizing or scaling the data.
o Data reduction: This involves reducing the complexity of the data by selecting a subset
of the most relevant features for analysis.
o Data discretization: This involves converting continuous variables into discrete

categories.
o Data normalization: This involves transforming the values of numeric variables so that
they have a mean of 0 and a standard deviation of 1.
o Data standardization: This involves transforming the values of numeric variables so that
they have a mean of 0 and a standard deviation of 1.
o Data augmentation: This refers to the technique of creating new data samples by
applying random transformations to existing samples.
o Data encoding: This refers to the process of converting categorical variables into
numerical values.
o Data hashing: This refers to the process of converting data into a fixed-length numerical
representation, so that it can be used as an input to machine learning algorithms.
● Statistical Analysis
Statistical analysis is the use of statistical methods to collect, organize, analyze, interpret and
present data. It is used to make inferences about a population based on a sample, and to test
hypotheses about relationships between variables. Some common techniques associated with
statistical analysis include:
o Descriptive statistics: This involves summarizing and describing the data using measures
such as mean, median, mode, standard deviation, and frequency distributions.
o Inferential statistics: This involves making inferences about a population based on a

sample, such as estimating population parameters or testing hypotheses about
relationships between variables.
o Hypothesis testing: This involves testing a claim or assumption about a population by

comparing sample data to a hypothetical value or a statistical model.
o Correlation analysis: This involves examining the relationship between two or more
variables, such as determining the strength
● Visualization of Data
Data visualization is the process of creating graphical representations of data sets in order to
make the information they contain more easily understandable. There are many techniques
associated with data visualization, including:
o Bar charts, which are used to compare the sizes of different data sets.
o Line charts, which are used to show how a data set changes over time.
o Scatter plots, which are used to show the relationship between two data sets.
o Pie charts, which are used to show the proportion of different parts of a data set.
o Heat maps, which are used to show how a data set is distributed across two dimensions.
o Tree maps, which are used to show the hierarchical structure of a data set.
o Word clouds, which are used to show the most common words in a data set.
o Choropleth maps, which are used to show the distribution of a data set across
geographic regions.
o Network diagrams, which are used to show the connections between different elements
in a data set.
o 3D plots, which are used to show data in 3D space.
o These are just a few examples of the many techniques that can be used for data
visualization. The best technique to use will depend on the specific data set and the
information you are trying to convey.
Operations to be performed on dataset: Steps in Preprocessing of Data

1. Download the dataset.
2. Open Google Colab (online experimentation)/ VS-Code (offline experimentation)
3. Read the .csv file of dataset
4. Display few observations
5. Display the data summary
6. Perform data preprocessing(handling missing data, etc)
7. Apply sum(),mean(), median(),standard deviation() functions on some attributes.
8. Perform suitable visualisations.
Program code:
Python Notebook has been attached in the submission’
Dataset used:
https://archive.ics.uci.edu/ml/datasets/wine+quality
Output:
FAQs:
Conclusion:
The pre-requisite of DL was studied and the implementation was performed for analysing Covid-19
dataset.

Batch 2 - PE31 - Data Preprocessing and Visualisation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Batch 2 - PE31 - Data Preprocessing and Visualisation

Uploaded by

Copyright:

Available Formats

(B.

TECH) Semester-VIII AY 2022-23

Student Name: Pratyush Jha PRN No.: 1032191000

o Surveys: This involves asking a set of questions to a group of people to gather

o Interviews: This involves conducting one-on-one conversations with individuals to gather

o Observations: This involves observing and recording the behaviour of individuals or

o Data discretization: This involves converting continuous variables into discrete

o Inferential statistics: This involves making inferences about a population based on a

o Hypothesis testing: This involves testing a claim or assumption about a population by

Operations to be performed on dataset: Steps in Preprocessing of Data

Python Notebook has been attached in the submission’

You might also like