Bi 20soeit11002 Antala Krishnaa

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

20SOECE11002 Business Intelligance

1. What is data and it’s types?

Ans. Data is a collection of information gathered by observations, measurements, research or


analysis. They may consist of facts, numbers, names, figures or even description of things. Data
is organized in the form of graphs, charts or tables. There exist data scientist who does data
mining and with the help of that data analyse our world.

Classification of Data
Data is classified into

• Qualitative: It describes the quality of something or someone. It is descriptive information.


For example, the skin colour, eye colour, hair texture, etc. gives us the qualitative information
about a person.
• Qualitative data is non-statistical and is typically unstructured or semi-structured. This
data isn’t necessarily measured using hard numbers used to develop graphs and charts.
Instead, it is categorized based on properties, attributes, labels, and other identifiers.

Qualitative data can be generated through:

• Texts and documents


• Audio and video recordings
• Interview transcripts and focus groups
• Observations and notes

• Quantitative: It provides numerical information. Example, the height and weight of a person.
• Contrary to qualitative data, quantitative data is statistical and is typically structured in
nature – meaning it is more rigid and defined. This data type is measured using
numbers and values, making it a more suitable candidate for data analysis

Quantitative data can be generated through:

• Tests
• Experiments
• Surveys
• Market reports
• Metrics

2. Why we do data preprocessing?

Data preprocessing allows for the removal of unwanted data with the use of data
cleaning, this allows the user to have a dataset to contain more valuable information after
the preprocessing stage for data manipulation later in the data mining process.

Data in the real world is dirty


incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records

3. What is data preprocessing?

Data preprocessing is the process of transforming raw data into an understandable


format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
4. List the factor of data quality.

There are five traits that you’ll find within data quality: accuracy,
completeness, reliability, relevance, and timeliness – read on to learn
more.

• Accuracy
• Completeness
• Reliability
• Relevance
• Timeliness

Characteristic How it’s measured

Accuracy Is the information correct in every detail?

Completeness How comprehensive is the information?

Reliability Does the information contradict other trusted


resources?

Relevance Do you really need this information?

Timeliness How up- to-date is information? Can it be used for real-


time reporting?
5. Explain stages or steps of data preprocessing in detail.

Ans. The steps used in data preprocessing include the following:


1. Data profiling: is the process of examining, analyzing and reviewing data to collect statistics
about its quality. It starts with a survey of existing data and its characteristics. Data scientists
identify data sets that are pertinent to the problem at hand, inventory its significant attributes,
and form a hypothesis of features that might be relevant for the proposed analytics or machine
learning task. They also relate data sources to the relevant business concepts and consider
which preprocessing libraries could be used.
2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable for
feature engineering.
3. Data reduction. Raw data sets often include redundant data that arise from characterizing
phenomena in different ways or data that is not relevant to a particular ML, AI or analytics task.
Data reduction uses techniques like principal component analysis to transform the raw data into
a simpler form suitable for particular use cases.
4. Data transformation. Here, data scientists think about how different aspects of the data need to
be organized to make the most sense for the goal. This could include things like
structuring unstructured data, combining salient variables when it makes sense or identifying
important ranges to focus on.
5. Data enrichment. In this step, data scientists apply the various feature engineering libraries to
the data to effect the desired transformations. The result should be a data set organized to
achieve the optimal balance between the training time for a new model and the required
compute.
6. Data validation. At this stage, the data is split into two sets. The first set is used to train a
machine learning or deep learning model. The second set is the testing data that is used to
gauge the accuracy and robustness of the resulting model. This second step helps identify any
problems in the hypothesis used in the cleaning and feature engineering of the data. If the data
scientists are satisfied with the results, they can push the preprocessing task to a data
engineer who figures out how to scale it for production. If not, the data scientists can go back
and make changes to the way they implemented the data cleansing and feature engineering
steps.

6. What are the tools used for data preprocessing?

1. Ans. R - Download R-3.3.0 for Windows. The R-project for statistical


computing.
2. Weka - Data Mining with Open Source Machine Learning Software in Java
3. RapidMiner - RapidMiner Account
4. Trifacta Wrangler - Trifacta Wrangler | Trifacta
5. Python - Welcome to Python.org
7. What are the ways to do data preprocessing?

Ans. The different ways to do data preprocessing practically are:


1. Data Quality Assessment.
2. Data Cleaning.
3. Data Transformation.
4. Data Reduction.

You might also like