Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Unit-I Introduction to Data Science

What is Data Science


Data science is an interconnected field that involves the use of statistical and
computational methods to extract insightful information and knowledge from
data. Python is a popular andversatile programming language, now has
become a popular choice among data scientists for its ease of use, extensive
libraries, and flexibility. Python provide and efficient and streamlined
approach to handing complex data structure and extracts insights.

How to Learn Data Science?


Usually, there are four areas to master data science.
• Industry Knowledge: Domain knowledge in which you are going to work
is necessary like if you want to be a data scientist in Blogging domain so you have
much informationabout blogging sector like SEOs, Keywords and serializing. It will
be beneficial in your data science journey.
• Models and logics Knowledge: All machine learning systems are built on
Models or algorithms, its important prerequisites to have a basic knowledge about
models that areused in data science.
• Computer and programming Knowledge: Not master level
programming knowledgeis required in data science but some basic like variables,
constants, loops, conditional statements, input/output, functions.
• Mathematics Used: It is an important part in data science. There is no
such tutorial presents but you should have knowledge about the topics : mean,
median, mode, variance, percentiles, distribution, probability, bayes theorem
and statistical tests likehypothesis testing, Anova, chi- square, p-value.

Introduction to Life Cycle of Data Science projects (Beginner Friendly)

As a data scientist aspirant, you must be keen to understand how the life cycle
of data science projects works so that it’s easier for you to implement your
individual projects in a similar pattern. Today, we will be basically discussing
the step-by- step implementation process of any data science project in a real-
world scenario.

What is a Data Science Project Lifecycle?

In simple terms, a data science life cycle is nothing but a repetitive set of steps
that you need to take to complete and deliver a project/product to your client.
Although the data science projects and the teams involved in deploying and
developing the model will be different,

every data science life cycle will be slightly different in every other company.
However, most of the data science projects happen to follow a somewhat
similar process.

In order to start and complete a data science-based project, we need to


understand the various roles and responsibilities of the people involved in
building, developing the project. Let us take a look at those employees who are
involved in a typical data science project:

Who Are Involved in The Projects?


• Business Analyst
• Data Analyst
• Data Scientists
• Data Engineer
• Data Architect
• Machine Learning Engineer

Now that we have an idea of who all are involved in a typical business project,
let’s understand what a data science project is and how do we define the life
cycle of the data science project in a real-world scenario like a fake news
identifier.
Why do we need to define the Life Cycle of a data science project?

In a normal case, a Data Science project contains data as its main


element. Without any data, we won’t be able to do any analysis or predict any
outcome as we are looking at something unknown. Hence, before starting any
data science project that we havegot from either our clients or stakeholder first
we need to understand the underlying problemstatement presented by them.
Once we understand the business problem, we have to gather the relevant data
that will

help us in solving the use case. However, for beginners many questions arise
like:
In what format do we need the data? How to get the data?What do we need to
do with data?

So many questions yet answers might vary from person to person. Hence in
order to address all these concerns right away, we do have a pre-defined flow
that is termed as Data Science Project Life Cycle. The process is fairly simple
wherein the company has to first gather data, perform data cleaning, perform
EDA to extract relevant features, preparing the data by

Performing feature engineering and feature scaling. In the second phase, the
model is built and deployed after a proper evaluation. This entire lifecycle is
not a one man’s job, for this, you need the entire team to work together to get
the work done by achieving the required amount of efficiency for the project

The globally accepted structure in resolving any sort of analytical problem is


popularly known as Cross Industry Standard Process for Data Mining or
abbreviated as CRISP-DM framework.
Data Science Lifecycle revolves around the use of machine learning and
different analytical strategies to produce insights and predictions from
information in order to acquire a commercial enterprise objective. The
complete method includes a number of steps like data cleaning, preparation,
modelling, model evaluation, etc. It is a lengthy procedure and may
additionally take quite a few months to complete. So, it is very essential to have
a generic structure to observe for each and every hassle at hand. The globally
mentioned structure in fixing any analytical problem is referred to as a Cross
Industry Standard Process for Data Mining or CRISP-DM framework.
Let us understand what is the need for Data Science?

Earlier data used to be much less and generally accessible in a well-structured


form, that wecould save effortlessly and easily in Excel sheets, and with the
help of Business Intelligencetools data can be processed efficiently. But
Today we used to deals with large amounts of data like about 3.0 quintals
bytes of records is producing on each and every day, which ultimately results
in an explosion of records and data. According to recent researches, It is
estimated that 1.9 MB of data and records are created in a second that too
through a single individual.
So this a very big challenge for any organization to deal with such a massive
amount of data generating every second. For handling and evaluating this data
we required some very powerful, complex algorithms and technologies and
this is where Data science comes into thepicture.
The following are some primary motives for the use of Data science technology:
• It helps to convert the big quantity of uncooked and unstructured
records into significantinsights.
• It can assist in unique predictions such as a range of surveys, elections, etc.
• It also helps in automating transportation such as growing a self-driving
car, we can saywhich the future of transportation is.
• Companies are shifting towards Data science and opting for this
technology. Amazon, Netflix, etc, which cope with the big quantity of data, are
the use of information sciencealgorithms for higher consumer experience.

Life Cycle of a Typical Data Science Project Explained:

• Understanding the Business Problem:

In order to build a successful business model, it’s very important to first


understand the business problem that the client is facing. Suppose he wants to
predict the customer churn rate of his retail business. You may first want to
understand his business, his requirements and what he is actually wanting to
achieve from the prediction. In such cases, it is important to take consultation
from domain experts and finally understand the underlying problems that are
present in the system. A Business Analyst is generally responsible for gathering
the required details from the client and forwarding the data to the data scientist
team for further speculation. Even a minute error in defining the problem and
understanding the requirement may be very crucial for the project hence it is to
be done with maximum precision.
After asking the required questions to the company stakeholders or clients, we
move to the next process which is known as data collection.

• Data Mining:
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.

Data mining techniques and tools enable enterprises to predict future trends
HYPERLINK
"https://www.techtarget.com/searchbusinessanalytics/feature/Top-5-
predictive-analytics-use-cases-in-enterprises" and make more-informed
business decisions.
After gaining clarity on the problem statement, we need to collect relevant
data to break the problem into small components.

The data science project starts with the identification of various data sources,

Data collection entails obtaining information from both known internal and
external sources that can assist in addressing the business issue.
Normally, the data analyst team is responsible for gathering the data. They need
to figure out proper ways to source data and collect the same to get the desired
results.
There are two ways to source the data:
• Through web scraping with Python
• Extracting Data with the use of third party APIs

• Data Cleaning:
This is a very important stage in the data science lifecycle, this stage includes
data cleaning, data reduction, data transformation, and data integration. This
stage takes lots of time and data scientists spend a significant amount of time
preparing the data.
Data cleaning is handling the missing values in the data and filling out these
missing values with appropriate values and smoothing out the noisy data.
Data reduction is using various strategies to reduce the size of data such that
the output remains the same and the processing time of data reduces.
Data transformation is transforming the data from one type to another type
so that it can beused efficiently for analysis and visualization.
Data integration is resolving any conflicts in the data and handling
redundancies. Data preparation is the most time-consuming process,
accounting for up to 90% of the total project duration, and this is the most
crucial step throughout the entire life cycle.
• Exploratory Data Analysis:
This step includes getting some concept about the answer and elements
affecting it, earlier than constructing the real model. Distribution of data inside
distinctive variables of a character is explored graphically the usage of bar-
graphs, Relations between distinct aspects are captured via graphical
representations like scatter plots and warmth maps. Many data visualization
strategies are considerably used to discover each and every characteristic
individually and by means of combining them with different features.

• Feature Engineering:
Feature engineering is a machine learning technique that leverages data to
create new variables that aren’t in the training set. It can produce new features
for both supervised and unsupervised learning, with the goal of simplifying and
speeding up data transformations while also enhancing model accuracy.
Feature engineering is required when working with machine learning models.
Regardless of the data or architecture, a terrible feature will have a direct
impact on your model.

• Data Modeling/predictive Modelling

Throughout most cases of data analysis, data modeling is regarded as the core
process. In this process of data modeling, we take the prepared data as the input
and with this, we try to prepare the desired output.
We first tend to select the appropriate type of model that would be implemented
to acquire results, whether the problem is a regression problem or
classification, or a clustering-based problem. Depending on the type of data
received we happen to choose the appropriate machine learning algorithm that
is best suited for the model. Once this is done, we ought to tune the hyper
parameters of the chosen models to get a favorable outcome.
Finally, we tend to evaluate the model by testing the accuracy and relevance.
In addition to this project, we need to make sure there is a correct balance
between specificity and generalizability, which is the created model must be
unbiased.

• Data visualization:

Is the graphical representation of information and data in a pictorial or


graphical format (Example: charts, graphs, and maps). Data visualization
tools provide an accessible way to see and understand trends, patterns in
data, and outliers. Data visualization tools and technologies are essential to
analyzing massive amounts of information and making data-driven
decisions. The concept of using pictures is to understand data that has been
used for centuries. General types of data visualization are Charts, Tables,
Graphs, Maps, and Dashboards.

Stages of Data Science Project


• Defining a problem:
The first stage of any Data Science project is to identify and define a problem
to be solved. Without a clearly defined problem to solve, it can be difficult
to know how to tackle to the problem. For a Data Science project this can
include what method to use, such as is classification, regression or clustering.
Also, without a clearly defined problem, it can be hard to determine what
your measure of success would be. Without a defined measure of success,
you can never know when your project is complete or is good enough to be
used in production. This can lead to wasted resources and time such as when
a simple logistic regression would have reached the target rather than a
complicated classification neural network.

• Data Processing:
Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for
data processing to be done correctly as not to negatively affect the end product, or
data output.
• The second state is data Processing
• This is to prepare the data to understand the business problem & extract
information to solve the problem
• Step for preparing data.
• Selecting data related to the problem.
• Combining the data sets. Data you may integrate the data.
• Clean the data to find the missing values.
• Handle the missing values by removing or imputing them
• Errors are dealt with by being removed.
• Use the box plots for detecting outliers & handling them
EDA (Exploratory Data Analysis):
Before really developing the model, this step entails understanding the
solution & the variables that may affect it. To understand the data & dada
features better. 70% of the data science project life is spend on this step. We
can extract lots of information with the proper EDA.

• Data Modelling:
• This is the most important step of the data science project.
• This phase is about selecting the right model type, depending on whether the issue is
classification regression or clustering, following are the selection of the model family, we
must carefully select & implement algorithms to be used inside that family.

• Data evaluation / model evaluation:


• To evaluate the model to understand the model. works better There are two
techniques used widely to assess the perforrmance of the model
• They are Hold out of cross-validation
• Hold out evaluation:
• It is the process of testing a model with data that is distinct from the data it was trained
on .This offers a frank assessment of learning effectiveness.
• Cross-validation:
• This is the process of splitting the data into sets & using them to analyze the
performance of the data. The initial observation data set is divided into two sets .
• Trainning set for the model's training & an independent set for the analyses
evaluation.

• Model Deployment
• This is the end stage of data science project.
• In this stage. the delivery method that will be used to distribute the model to users or another
system is created.
• for various projects, this step can mean. many different things getting your model results in a
Tableau/power BI dashboard might that is necessary.
• Or as complicated as growing it to millions of users on the cloud

• Job Roles in Data Science

• Data Analyst

Data analysts are responsible for a variety of tasks including visualization


,manging, and processing of massive amounts of data. They also have to
perform queries on the databases from time to time.
One of the most important skills of a data analyst is optimization. This is
because they have to create and modify algorithms that can be used to call
information from some of the biggest databases without corrupting the data.

Few Important Roles and Responsibilities of a Data Analyst include:

• Extracting data from primary and secondary sources using automated tools
• Developing and maintaining databases
• Performing data analysis and making reports with recommendations
• Analyzing data and forecasting trends that impact the organization/project
• Working with other team members to improve data collection and quality processes.
SQL, R, SAS, and Python are some of the sought-after technologies for data
analysis. So, certification in these can easily give a boost to your job
applications. You should also have good problem-solving qualities.

• Data Engineers

Data engineers build and test scalable Big Data ecosystems for the businesses
so that the datascientists can run their algorithms on the data systems that are
stable and highly
optimized. Data engineers also update the existing systems with newer or
upgraded versionsof the current technologies to improve the efficiency of the
databases.

Few Important Roles and Responsibilities of a Data Engineer include:

• Design and maintain data management systems


• Data collection/acquisition and management
• Conducting primary and secondary research
• Finding hidden patterns and forecasting trends using data
• Collaborating with other teams to perceive organizational goals
• Make reports and update stakeholders based on
analytics
If you are interested in a career as a data engineer then technologies that
require hands-on experience include Hive, NoSQL, R, Ruby, Java, C++, and
Matlab. It would also help if youcan work with popular data APIs and ETL
tools, etc.

• Database Administrator

The job profile of a database administrator is pretty much self-explanatory-


they are responsible for the proper functioning of all the databases of an
enterprise and grant or revokeits services to the employees of the company
depending on their requirements. They are also responsible for database
backups and recoveries.

Few Important Roles and Responsibilities of a Database Administrator include:


• Working on database software to store and manage data
• Working on database design and development
• Implementing security measures for database
• Preparing reports, documentation, and operating manuals
• Data archiving
• Working closely with programmers, project managers, and other
team members.
Some of the essential skills and talents of a database administrator include
database backupand recovery, data security, data modelling, and design,
etc. If you are good at disaster management, it’s certainly a bonus.

• Machine Learning Engineer


Machine learning engineers are in high demand today. However, the job
profile comes withits challenges. Apart from having in-depth knowledge of
some of the most powerful technologies such as SQL,REST APIs, etc.
machine learning engineers are also expected toperform A/B testing, build
data pipelines, and implement common Machine learning models such as
classification, clustering, etc.
Few Important Roles and Responsibilities of a Machine Learning Engineer include:
• Designing and developing Machine Learning systems
• Researching Machine Learning Algorithms
• Testing Machine Learning systems
• Developing apps/products basis client requirements
• Extending existing Machine Learning frameworks and libraries
• Exploring and visualizing data for a better understanding
• Training and retraining systems
Firstly, you must have a sound knowledge of some of the technologies like
Java, Python, JS,etc. Secondly, you should have a strong grasp of statistics
and mathematics. Once you have mastered both, it’s a lot easier to crack.
• Data Scientist
Data scientists have to understand the challenges of business and offer the best
solutions using data analysis and data processing. For instance, they are
expected to perform predictiveanalysis and run a fine- toothed comb through
an “unstructured/disorganized” data to offer actionable insights. They can also
do this by identifying trends and patterns that can help the companies in
making better decisions.

Few Important Roles and Responsibilities of a Data Scientist include:


• Identifying data collection sources for business needs
• Processing, cleansing, and integrating data
• Automation data collection and management process
• Using Data Science techniques/tools to improve processes
• Analyzing large amounts of data to forecast trends and provide
reports withrecommendations
• Collaborating with business, engineering, and product teams
How to Become a Data Scientist?
To become a data scientist you have to be an expert in R, MatLab, SQL,
Python, and other complementary technologies. It can also help if you have
a higher degree in mathematics orcomputer engineering, etc.

• Data Architect
A data architect creates the blueprints for data management so that the
databases can be easily integrated, centralized, and protected with the best
security measures. They also ensure that the data engineers have the best tools
and systems to work with.

Few Important Roles and Responsibilities of a Data Architect include:


• Developing and implementing overall data strategy in line
with business/organization
• Identifying data collection sources in line with data strategy
• Collaborating with cross-functional teams and stakeholders for
smooth functioning of database systems
• Planning and managing end-to-end data architecture
• Maintaining database systems/architecture considering efficiency and security
• Regular auditing of data management system performance
and making changes toimprove systems accordingly.
How to Become a Data Architect?
A career in data architecture requires expertise in data warehousing, data
modelling, extraction transformation and loan (ETL), etc. You also must be
well versed in Hive, Pig, andSpark, etc.

• Statistician
A statistician, as the name suggests, has a sound understanding of statistical
theories and dataorganization. Not only do they extract and offer valuable
insights from the data clusters, but they also help create new methodologies
for the engineers to apply.

Few Important Roles and Responsibilities of a Statistician include:


• Collecting, analyzing, and interpreting data
• Analyzing data, assessing results, and predicting trends/relationships
using statisticalmethodologies/tools
• Designing data collection processes
• Communicating findings to stakeholders
• Advising/consulting on organizational and business strategy basis data
• Coordinating with cross-functional teams
How to Become a Statistician?
A statistician has to have a passion for logic. They are also good with a variety
of databasesystems such as SQL, data mining, and the various machine
learning technologies.

• Business Analyst
The role of business is slightly different than other data science jobs. While
they dohave a good understanding of how data-oriented technologies work
and how to handle largevolumes of data, they also separate the high-value
data from the low-value data. In other words, they identify how the Big
Data can be linked to actionable business insights for business growth.

Few Important Roles and Responsibilities of a Business Analyst include:


• Understanding the business of the organization
• Conducting detailed business analysis – outlining problems,
opportunities, andsolutions
• Working on improving existing business processes
• Analysing, designing, and implementing new technology and systems
• Budgeting and forecasting
• Pricing analysis
Business analysts act as a link between the data engineers and the
management executives. So, they should have an understanding of business
finances and business intelligence, and also the IT technologies like data
modelling, data visualization tools, etc.

• Data and Analytics Manager


A data and analytics manager oversees the data science operations and assigns
the duties to their team according to skills and expertise. Their strengths should
include technologies likeSAS, R, SQL, etc.
Few Important Roles and Responsibilities of a Data and Analytics Manager include:
• Developing data analysis strategies
• Researching and implementing analytics solutions
• Leading and managing a team of data analysts
• Overseeing all data analytics operations to ensure quality
• Building systems and processes to transform raw data into
actionable businessinsights
• Staying upto date on industry news and trends

Applications of Data Science


• In Search Engines
• In Transport
• In Finance
• In E-Commerce
• In Health Care
• Image Recognition
• Targeting Recommendation
• Airline Routing Planning
• Data
Science in Gaming
• In
Delivery Logistics
• Autocomplete
• In Search Engines
The most useful application of Data Science is Search Engines. As
we know whenwe want to search for something on the internet, we mostly
used Search engines like Google, Yahoo, Safari, Firefox, etc. So Data
Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and
algorithm courses”then at that time on the Internet Explorer we get the first
link of Geeks for Geeks Courses.
This happens because the Geeks for Geeks website is visited most in order to
get informationregarding Data Structure courses and Computer related
subjects. So this analysis is Done using Data Science, and we get the Topmost
visited Web Links.
• In Transport
Data Science also entered into the Transport field like Driverless
Cars. With thehelp of Driverless Cars, it is easy to reduce the number of
Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm
and with the help of Data Science techniques, the Data is analyzed like what
is the speed limit in Highway, Busy Streets, Narrow Roads, etc. And how to
handle different situations whiledriving etc.
• In Finance
Data Science plays a key role in Financial Industries. Financial
Industries always have an issue of fraud and risk of losses. Thus, Financial
Industries needs to automate riskof loss analysis in order to carry out
strategic decisions for the company. Also, Financial Industries uses Data
Science Analytics tools in order to predict the future. It allows the companies
to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock
Market, DataScience is used to examine past behavior with past data and
their goal is to examine them future outcome. Data is analyzed in such a way
that it makes it possible to predict futurestock prices over a set timetable.
• In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make abetter user
experience with personalized recommendations.
For Example, when we search for something on the E-commerce websites we
get suggestions similar to choices according to our past data and also we get
recommendationsaccording to most buy the product, most rated, most
searched, etc. This is all done with the help of Data Science.

• In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
• Detecting Tumor.
• Drug discoveries.
• Medical Image Analysis.
• Virtual Medical Bots.
• Genetics and Genomics.
• Predictive Modeling for Diagnosis etc.
• Image Recognition
Currently, Data Science is also used in Image Recognition. For
Example, when weupload our image with our friend on Facebook, Facebook
gives suggestions Tagging who isin the picture. This is done with the help of
machine learning and Data Science. When an Image is recognized, the data
analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
• Targeting Recommendation
Targeting Recommendation is the most important application of Data
Science. Whatever the user searches on the Internet, he/she will see numerous
posts everywhere. This can be explained properly with an example: Suppose I
want a mobile phone, so I just Google search it and after that, I changed my
mind to buy offline. Data Science helps thosecompanies who are paying for
Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation
of that mobile phone which I searched for. So this will force me to buy online.
• Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like
with the help of it,it becomes easy to predict flight delays. It also helps to
decide whether to directly land into the destination or take a halt in between
like a flight can have a direct route from Delhi to the U.S.A or it can halt in
between after that reach at the destination.
• Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a
Computer Opponent, data science concepts are used with machine learning
where with the help of past data the Computer will improve its performance.
There are many games like Chess,EA Sports, etc. will use Data Science
concepts.

• In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data
Science. Data Science helps these companies to find the best route for the
Shipment of their Products, the best time suited for delivery, the best mode of
transport to reach the destination, etc.
• Autocomplete
AutoComplete feature is an important part of Data Science where the
user will get the facility to just type a few letters or words, and he will get the
feature of auto-completingthe line. In Google Mail, when we are writing
formal mail to someone so at that time data science concept of Autocomplete
feature is used where he/she is an efficient choice to auto-complete the whole
line. Also in Search Engines in social media, in various apps, AutoComplete
feature is widely used.

• Speech Recognition
Some of the best examples of speech recognition products are Google
Voice, Siri, Cortana etc. Using the speech-recognition feature, even if you
aren’t in a position to type a message, your life wouldn’t stop. Simply speak
out the message and it will be converted to text. However, at times, you would
realize, speech recognition doesn’t perform accurately.

Data collection Strategies

What is Data Collection? | Methods of Collecting Data


What is Data Collection?
Data Collection is the systematic process of gathering, measuring, and
recording data for research, analysis, or decision-making. It involves
collecting data from various sources, such as surveys, interviews,
observations, experiments, documents, or existing databases, to
obtainrelevant and reliable information.
Data Collection is the process of collecting information from relevant sources
in order to find a solution to the given statistical enquiry. Collection of Data is
the first and foremost step in astatistical investigation.
Here, statistical enquiry means an investigation made by any agency on a
topic in which theinvestigator collects the relevant quantitative information.
In simple terms, a statistical enquiry is the search of truth by using statistical
methods of collection, compiling, analysis, interpretation, etc.
The basic problem for any statistical enquiry is the collection of facts and
figures related to this specific phenomenon that is being studied. Therefore,
the basic purposeof data collection is collecting evidence to reach a sound
and clear solution to a problem.
Data collection is a process of measuring and gathering information on desired
variables in a fashion so that questions related to the data can be found and
used in research of various types. Data collection is a common feature of study
in various disciplines, such as marketing, statistics, economics, sciences, etc.
The methods of collecting data may vary according to subjects, but the
ultimate aim of the study and honesty in data collection are of the
sameimportance in all matters of study.

Types of Data Collection


Depending on the nature of data collection, it can be divided into two major types, namely:
• Primary data collection method
• Secondary data collection method.
Important Terms related to Data Collection:
• Investigator: An investigator is a person who conducts the statistical enquiry.
• Enumerators: In order to collect information for statistical enquiry,
an investigator needsthe help of some people. These people are known as
enumerators.
• Respondents: A respondent is a person from whom the statistical
information required forthe enquiry is collected.
• Survey: It is a method of collecting information from individuals. The
basic purpose of a survey is to collect data to describe different characteristics such as
usefulness, quality, price,kindness, etc. It involves asking questions about a product or
service from a large number of people.
Thus, Data is a tool that helps an investigator in understanding the problem
by providing himwith the information required. Data can be classified into
two types; viz., Primary Data and Secondary Data. Primary Data is the data
collected by the investigator from primary sources for the first time from
scratch. However, Secondary Data is the data alreadyin existence that has
been previously collected by someone else for other purposes. It does not
include any real-time data as the research has already been done on that
information.
Methods of Collecting Data
There are two different methods of collecting data: Primary Data Collection
and Secondary Data Collection.

• Methods of Collecting Primary Data:


Primary Data Collection: Quantitative Data Collection
There are a number of methods of collecting primary data, Some of the
common methods are as follows:
• Direct Personal Investigation: As the name suggests, the method of direct
personal investigation involves collecting data personally from the source of origin. In
simple words, the investigator makes direct contact with the person from whom he/she
wants to obtain information. This method can attain success only when the investigator
collecting data is efficient, diligent, tolerant and impartial. For example, direct contact
with the household women to obtain information about their daily routine and schedule.
• Indirect Oral Investigation: In this method of collecting primary data, the
investigator does not make direct contact with the person from whom he/she needs
information, instead they collect the data orally from some other person who has the
necessary required information. For example, collecting data of employees from their
superiors or managers.
• Information from Local Sources or Correspondents: In this method,
for the collection of data, the investigator appoints correspondents or local persons at
various places, which are then furnished by them to the investigator. With the help of
correspondents and local persons,the investigators can cover a wide area.
• Information questionaires and
schedules method: In this method of collecting primary data, the investigator,
while keeping in mind the motive of the study, prepares a questionnaire. The investigator
can collect data
through the questionnaire in two ways:
• Mailing Method: This method involves mailing the questionnaires to the
informants for the collection of data. The investigator attaches a letter with the
questionnaire in the mailto define the purpose of the study or research. The
investigator also assures the informants that their information would be kept secret,
and then the informants note the answers to the questionnaire and return the
completed file.
• Enumerator’s Method: This method involves the preparation of a
questionnaire according to the purpose of the study or research. However, in this case,
the enumerator reaches out to the informants himself with the prepared questionnaire.
Enumerators are not the investigators themselves; they are the people who help the
investigator in the collection of data.
Primary data is collected by researchers on their own and for the first time
in a study. There are various ways of collecting primary data, some of
which are the following:
• Interview: Interviews are the most used primary data collection
method. In interviews a questionnaire is used to collect data or the
researcher may ask questions directly to the interviewee. The idea is to
seek information on concerning topics from the answers of the
respondent. Questionnaires used can be sent via email or details can be
asked over telephonic interviews.
• Delphi Technique: In this method, the researcher asks for information
from the panel of experts. The researcher may choose in-person
research or questionnaires may be sent via email. At the end of the
Delphi technique, all data is collected according to the need of the
research.
• Projective techniques: Projective techniques are used in research that
is private or confidential in a manner where the researcher thinks that
respondents won’t reveal information if direct questions are asked.
There are many types of projective techniques, such as Thematic
appreciation tests (TAT), role-playing, cartoon completion, word
association, and sentence completion.
• Focus Group Interview: Here a few people gather to discuss the
problem at hand. The number of participants is usually between six to
twelve in such interviews. Every participant expresses his own insights
and a collective unanimous decision is reached.
• Questionnaire Method: Here a questionnaire is used for collecting
data from a diverse group population. A set of questions is used for the
concerned research and respondents answer queries related to the
questionnaire directly or indirectly. This method can either be open-
ended or closed-ended.

• Methods of Collecting Secondary Data or Qualitative Data


Secondary data can be collected through different published and unpublished
sources. Some of them are as follows:
• Government Publications: Government publishes different documents
which consists of different varieties of information or data published by the
Ministries, Central and State Governments in India as their routine activity. As the
government publishes these Statistics, they are fairly reliable to the investigator.
Examples of Government publications on Statistics are the Annual Survey of
Industries, Statistical Abstract of India, etc.
• Semi-Government Publications: Different Semi-Government bodies also
publish data related to health, education, deaths and births. These kinds of data are
also reliable and used by different informants. Some examples of semi- government
bodies are MetropolitanCouncils, Municipalities, etc.
• Publications of Trade Associations: Various big trade associations collect
and publish data from their research and statistical divisions of different trading
activities and their aspects. For example, data published by Sugar Mills Association
regarding different sugar mills in India.
• Journals and Papers: Different newspapers and magazines provide a variety
of statistical data in their writings, which are used by different investigators for their
studies.
• International Publications: Different international organizations like IMF,
UNO, ILO, World Bank, etc., publish a variety of statistical information which are
used as secondary data.
• Publications of Research Institutions: Research institutions and universities
also publish their research activities and their findings, which are used by different
investigators as secondary data. For example National Council of Applied
Economics, the Indian Statistical Institute, etc.
• Unpublished Sources
Another source of collecting secondary data is unpublished sources. The
data in unpublished sources is collected by different government organizations
and other organizations. These organizations usually collect data for their self-
use and are not published anywhere. For example, research work done by
professors, professionals, teachers and records maintained by business and
private enterprises.
Primary Data Collection Methods
Primary data or raw data is a type of information that is obtained directly from
the first-hand source through experiments, surveys or observations. The
primary data collection method isfurther classified into two types. They are
• Quantitative Data Collection Methods
• Qualitative Data Collection Methods
Let us discuss the different methods performed to collect the data under these
two datacollection methods.
Quantitative Data Collection Methods
It is based on mathematical calculations using various formats like close-ended
questions, correlation and regression methods, mean, median or mode
measures. This method is cheaper than qualitative data collection methods and
it can be applied in a short duration of time.

Qualitative Data Collection Methods


It does not involve any mathematical calculations. This method is closely
associated withelements that are not quantifiable. This qualitative data
collection method includes interviews, questionnaires, observations, case
studies, etc. There are several methods to collect this type of data. They are

Observation Method

Observation method is used when the study relates to behavioural science.


This method is planned systematically. It is subject to many controls and
checks. The different types of observations are:
• Structured and unstructured observation
• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation
Interview Method

The method of collecting data in terms of verbal responses. It is achieved in


two ways, such as
• Personal Interview – In this method, a person known as an
interviewer is required to askquestions face to face to the other
person. The personal interview can be structured or
unstructured, direct investigation, focused conversation, etc.
• Telephonic Interview – In this method, an interviewer obtains
information by contactingpeople on the telephone to ask the
questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should
read, reply andsubsequently return the questionnaire. The questions are
printed in the definite order on the form. A good survey should have the
following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality of the
paper to attract theattention of the respondent
Schedules

This method is similar to the questionnaire method with a slight difference.


The enumerationsare specially appointed for the purpose of filling the
schedules. It explains the aims and objects of the investigation and may
remove misunderstandings, if any have come up. Enumerators should be
trained to perform their job with hard work and patience.

Secondary Data Collection Methods


Secondary data is data collected by someone other than the actual user. It
means that the information is already available, and someone analyses it. The
secondary data includes magazines, newspapers, books, journals, etc. It may
be either published data or unpublisheddata.
Published data are available in various resources including
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies, etc.

Whether you’re collecting data for business or academic research, the first
step is to identify the type of data you need to collect and what method you’ll
use to do so. In general, there are two data types— primary and secondary
—and you can gather both with a variety of effective collection methods.

Primary data refers to original, firsthand information, while secondary data


refers to information retrieved from already existing sources. Peter Drow,
head of marketing at NCCuttingTools, explains that “original findingsare
primary data, whereas secondary data refers to information that has already
been reported in secondary sources, such as books, newspapers, periodicals,
magazines, web portals, etc.”

Both primary and secondary data-collection methods have their pros, cons,
and particular use cases. Read on for an explanation of your options and a
listof some of the best methods to consider.

Primary data-collection methods

As mentioned above, primary data collection involves gathering original and


first hand source information. Primary data-collection methods help
researchers or service providersobtain specific and up-to-date information
about their research subjects. These methods involve reaching out to a
targeted group of people and sourcing data from them through surveys,
interviews, observations, experiments, etc.

You can collect primary data using quantitative or qualitative methods. Let’s
take a closerlook at the two:

Quantitative data-collection :methods involve collecting information that


you can analyze numerically. Closed-ended surveys and questionnaires with
predefined options are usuallythe ways researchers collect quantitative
information. They can then analyze the results using mathematical
calculations such as means, modes, and grouped frequencies. An example is
a simple poll. It’s easy to quickly determine or express the number of
participants who choose a specific option as a percentage of the whole.

Qualitative data collection: involves retrieving nonmathematical data from


primary sources. Unlike quantitative data-collection methods where subjects
are limited to predefined options, qualitative data-collection methods give
subjects a chance to freely express their thoughts about the research topic. As
a result, the data researchers collect viathese methods is unstructured and often
non quantifiable.
Here’s an important difference between the two: While quantitative methods
focus on understanding “what,” “who,” or “how much,” qualitative methods
focus on understanding“why” and “how.” For example, quantitative research
on parents may show trends that are specific to fathers or mothers, but it may
not uncover why those trends exist.
Drow explains that applying quantitative methods is faster and cheaper than
applying qualitative methods. “It is simple to compare results because
quantitative approaches arehighly standardized. In contrast, qualitative
research techniques rely on words, sounds, feelings, emotions, colors, and
other intangible components.”
Drow emphasizes that the field of your study and the goals and objectives of
your research will influence your decision about whether to use quantitative or
qualitative methodologies for data collection.
Below are some examples of primary data-collection methods:

• Questionnaires and surveys

While researchers often use the terms “survey” and “questionnaire” inter
changeably, the two mean slightly different things.

A questionnaire refers specifically to the set of questions researchers use to


collect information from respondents. It may include closed-ended questions,
which means respondents are limited to predefined answers, or open-ended
questions, which allowrespondents to give their own answers.
A survey includes the entire process of creating questionnaires, collecting
responses, and analyzing the results.
You can also analyze survey results in easy-to-read spreadsheets, charts, and
more.

• Interviews

An interview is a conversation in which one participant asks questions and


the other provides answers. Interviews work best for small groups and help
you understand theopinions and feelings of respondents.

Interviews may be structured or unstructured. Structured interviews are


similar to questionnaires and involve asking predetermined questions with
specific multiple-choice answers. Unstructured interviews, on the other
hand, give subjects the freedom to providetheir own answers. You can
conduct interviews in person or via recorded video or audio conferencing.

• Focus groups

A focus group is a small group of people who have an informal discussion


about a particular topic, product, or idea. The researcher selects participants
with similar interests,gives them topics to discuss, and records what they
say.
Focus groups can help you better understand the results of a large-group
quantitative study. For example, a survey of 1,000 respondents may help
you spot trends and patterns,but a focus group of 10 respondents will
provide additional context for the results of the large-group survey.

• Observation

Observation involves watching participants or their interactions with specific


products or objects. It’s a great way to collect data from a group when they’re
unwilling or unable to participate in interviews — children are a good
example.
You can conduct observations covertly or overtly. The former involves
discreetly observing people’s behavior without their knowledge. This
allows you to see them actingnaturally. On the other hand, you have to
conduct overt observation openly, and it may cause the subjects to behave
unnaturally.

Advantages of primary data-collection methods

• Accuracy: You collect data firsthand from the target


demographic, which leavesless room for error or misreporting.
• Recency: Sourcing primary data ensures you have the most up-to-
date informationabout the research subject.
• Control: You have full control over the data-collection process
and can makeadjustments where necessary to improve the
quality of the data you collect.
• Relevance: You can ask specific questions that are
directly relevant to yourresearch.
• Privacy: You can control access to the research results and maintain the
confidentiality.

Disadvantages of primary data collection

• Cost: Collecting primary data can be expensive, especially if


you’re working witha large group.
• Labor: Collecting raw data can be labor intensive. When you’re
gathering datafrom large groups, you need more skilled hands.
And if you’re researching something arcane or unusual, it might
be difficult to find people with the appropriate expertise.
• Time: Collecting primary data takes time. If you’re conducting
surveys, for example, participants have to fill out questionnaires.
This could take anywhere from a few days to several months,
depending on the size of the study group, howyou deliver the survey,
and how quickly participants respond. Post-survey activities, such as
organizing and cleaning data to make it usable, also add up.

Secondary data-collection methods

Secondary data collection involves retrieving already available data from


sources otherthan the target audience. When working with secondary data,
the researcher doesn’t “collect” data; instead, they consult secondary data
sources.
Secondary data sources are broadly categorized into published and
unpublished data. As the names suggest, published data has been
published and released for public or private use, while unpublished data
comprises unreleased private information that researchers orindividuals
have documented.
When choosing public data sources, Drow strongly recommends
considering the date ofpublication, the author’s credentials, the source’s
dependability, the text’s level of discussion and depth of analysis, and the
impact it has had on the growth of the field of study.

Below are some examples of secondary data sources:


• Online journals, records, and publications:
Data that reputable organizations have collected from research is usually
published online.Many of these sources are freely accessible and serve as reliable
data sources. But it’s bestto search for the latest editions of these publications
because dated ones may provide invalid data.

• Government records and publications:


Periodically, government institutions collect data from people. The
information can range from population figures to organizational records and other
statistical information such as age distribution. You can usually find information like
this in government libraries and useit for research purposes.

• Business and industry records


Industries and trade organizations usually release revenue figures and
periodic industry trends in quarterly or biannual publications. These records
serve as viable secondary datasources since they’re industry-specific.
Previous business records, such as companies’ sales and revenue figures,
can also be useful for research. While some of this information is available
to the public, you mayhave to get permission to access other records.

• Newspapers
Newspapers often publish data they’ve collected from their own surveys.
Due to the volume of resources you’ll have to sift through, some surveys may
be relevant to your niche but difficult to find on paper. Luckily, most newspapers
are also published online, solooking through their online archives for specific
data may be easier.

• Unpublished sources

These include diaries, letters, reports, records, and figures belonging to private
individuals;these sources aren’t in the public domain. Since authoritative
bodies haven’t vetted or published the data, it can often be unreliable.

Advantages of secondary data-collection methods

Below are some of the benefits of secondary data-collection methods and their
advantages over primary methods.

• Speed: Secondary data-collection methods are efficient because


delayed responsesand data documentation don’t factor into the
process. Using secondary data, analysts can go straight into data
analysis.
• Low cost: Using secondary data is easier on the budget when
compared to primary data collection. Secondary data often allows you
to avoid logistics and other survey expenses.
• Volume: There are thousands of published resources available for
data analysis. You can sift through the data that several individual
research efforts have producedto find the components that are most
relevant to your needs.
• Ease of use: Secondary data, especially data that organizations and the
governmenthave published, is usually clean and organized. This makes
it easy to understand and extract.
• Ease of access: It’s generally easier to source secondary data than
primary data. Abasic internet search can return relevant information
at little or no cost.

Disadvantages of secondary data collection

• Lack of control: Using secondary data means you have no control


over the surveyprocess. Already published data may not include the
questions you need answers to. This makes it difficult to find the
exact data you need.
• Lack of specificity: There may not be many available reports for
new industries, and government publications often have the same
problems. Furthermore, if there’sno available data for the niche your
service specializes in, you’ll encounter problems using secondary
data.
• Lack of uniqueness: Using secondary sources may not give you
the originality anduniqueness you need from data. For instance, if
your service or product hinges on innovation and uses an out-of-
the-norm approach to problem-solving, you may be disappointed
by the generic nature of the data you collect.
• Age: Because user preferences change over time, data can evolve.
The secondarydata you retrieve can become invalid. When this
happens, it becomes difficult to source new data without
conducting a hands-on survey.

What are Statistical Errors?


The errors which are occurred while collecting data are known as Statistical
Errors. These are dependent on the sample size selected for the study. There
are two types of Statistical Errors; viz., Sampling
Errors and Non-SamplingErrors.
• Sampling Errors:
The errors which are related to the nature or size of the sample selected for the
study are known as Sampling Errors. If the size of the sample selected is very
small or the nature of the sample is non-representative, then the estimated
value may differ from the actual value of a parameter. This kind of error is
sampling error. For example, if the estimated value ofa parameter is 10, while
the actual value is 30, then the sampling error will be 10-30=-20.
Sampling Error = Estimated Value – Actual Value

• Non-Sampling Errors:
The errors related to the collection of data are known as Non-Sampling Errors. The
different types of Non-Sampling Errors are Error of Measurement, Error of Non-
response,Error of Misinterpretation, Error of Calculation or Arithmetical Error, and
Error of Sampling Bias.
• Error of Measurement:
The reason behind the occurrence of Error of Measurement may be
difference in the scaleof measurement and difference in the rounding-off
procedure that is adopted by different investigators.
• Error of Non-response:
These errors arise when the respondents do not offer the information required for the
study.
• Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given
in the questionnaire.
• Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
• Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target population
cannot be included in the sample choice.
Note: If the field of investigation is larger or the size of the population is larger,
then thepossibility of the occurrence of errors related to the collection of data
is high. Besides, a

non-sampling error is more serious than a sampling error. It is because


one can minimizethe sampling error by opting for a larger sample size
which is not possible in the case of non-sampling errors.

Data Pre-processing Overview

Data Preprocessing
Data preprocessing is an important step in the data mining process. It refers to
the cleaning,transforming, and integrating of data in order to make it ready for
analysis. The goal of datapreprocessing is to improve the quality of the data
and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that


involves cleaning andtransforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data,such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation, removal,
and transformation.

Data Integration: This involves combining data from multiple sources to


create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can beused for data integration.

Data Transformation: This involves converting the data into a suitable


format for analysis.Common techniques used in data transformation include
normalization, standardization, anddiscretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discre tization is used to convert
continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as featureselection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.

Data Discretization: This involves dividing continuous data into discrete


categories or intervals. Discretization is often used in data mining and
machine learning algorithms thatrequire categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.

Data Normalization: This involves scaling the data to a common range,


such as between 0and 1 or -1 and 1. Normalization is often used to handle
data with different units and scales.

Common normalization techniques include min-max normalization, z-score


normalization,and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the
accuracy of the analysis results. The specific steps involved in data
preprocessing may vary depending on thenature of the data and the analysis
goals.
By performing these steps, the data mining process becomes more efficient
and the resultsbecome more accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the
raw data in auseful and efficient format.

Steps Involved in Data Preprocessing:


• Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning isdone. It involves handling of missing data, noisy data etc.

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in variousways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiplevalues are missing within a tuple.

• Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

• (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generateddue to faulty data collection, data entry errors etc. It can
be handled in following ways :
• Binning Method:

This method works on sorted data in order to smooth it. The whole data
is divided intosegments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression usedmay be linear (having one independent variable) or
multiple (having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or itwill fall outside the clusters.
• Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for miningprocess. This involves following ways:
• Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

• Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help themining process.

• Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptuallevels.

• Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.

• Data Reduction:
Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial
in situations where the dataset is too large to be processed efficiently, or where
the dataset contains a large amountof irrelevant or redundant information.
There are several different data reduction techniques that can be used
indata mining, including:

• Data Sampling: This technique involves selecting a subset of the data


to work with, rather than using the entire dataset. This can be useful for reducing
the size of a datasetwhile still preserving the overall trends and patterns in the
data.
• Dimensionality Reduction: This technique involves reducing the number
of features inthe dataset, either by removing features that are not relevant or by
combining multiple features into a single feature.
• Data Compression: This technique involves using techniques such as lossy
or losslesscompression to reduce the size of a dataset.
• Data Discretization: This technique involves converting continuous
data into discretedata by partitioning the range of possible values into intervals
or bins.
• Feature Selection: This technique involves selecting a subset of
features from thedataset that are most relevant to the task at hand.
• It’s important to note that data reduction can have a trade-off between the
accuracy and the size of the data. The more data is reduced, the less accurate the
model will be and theless generalizable it will be.

Discretization

Data discretization refers to a method of converting a huge number of data


values into smaller ones so that the evaluation and management of data
become easy. In other words, data discretization is a method of converting
attributes values of continuous data into a finiteset of intervals with minimum
data loss. There are two forms of data discretization first is supervised
discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised
discretizationrefers to a method depending upon the way which operation
proceeds. It means it works on the top-down splitting strategy and bottom-up
merging strategy.

Now, we can understand this concept with the


help of an exampleSuppose we have an attribute
of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Backward Skip 10sPlay VideoForward Skip


10s
Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website


visitors. For example, all visitors who visit the site with the IP address of India
are shown under country level.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency


distribution of a continuous data set. Histogram assists the data inspection for
data distribution. For example, Outliers, skewness representation, normal
distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number
of continuous values into smaller values. For data discretization and the
development of idea hierarchy, this technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is


executed by dividing the values of x numbers into clusters to isolate a
computational feature of x.

Data discretization refers to a decision tree analysis in which a top-down


slicing technique isused. It is done through a supervised procedure. In a
numeric attribute discretization, first, you need to select the attribute that has
the least entropy, and then you need to run it with thehelp of a recursive
process. The recursive process divides it into various discretized disjoint
intervals, from top to bottom, using the same splitting criterion.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing


technique is used. It is done through a supervised procedure. In a numeric
attribute discretization, first, you need to select the attribute that has the least
entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top
to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are combined to develop a
larger overlap to form the final 20 overlapping intervals. It is a supervised
procedure.

What is Outlier Analysis

Whenever we talk about data analysis, the term outliers often come to our mind.
As the name suggests, "outliers" refer to the data points that exist outside of
what is to be expected. The major thing about the outliers is what you do with
them. If you are going to analyze any task to analyze data sets, you will always
have some assumptions based on how this data is generated. If you find some
data points that are likely to contain some form of error, then these are
definitely outliers, and depending on the context, you want to overcome those
errors. The data mining process involves the analysis and prediction of data
that the data holds. In 1969, Grubbs introduced the first definition of outliers.

Difference between outliers and noise

Any unwanted error occurs in some previously measured variable, or there is


any variance in the previously measured variable called noise. Before finding
the outliers present in any data set, it is recommended first to remove the noise.

Types of Outliers

Outliers are divided into three different types

• Global or point outliers


• Collective outliers
• Contextual or conditional outliers
Global Outliers

Global outliers are also called point outliers. Global outliers are taken as the
simplest form of outliers. When data points deviate from all the rest of the data
points in a given data set, it is
known as the global outlier. In most cases, all the outlier detection procedures
are targeted to determine the global outliers. The green data point is the global
outlier.

Collective Outliers

In a given set of data, when a group of data points deviates from the rest of the
data set is called collective outliers. Here, the particular set of data objects may
not be outliers, but when you consider the data objects as a whole, they may
behave as outliers. To identify the types of different outliers, you need to go
through background information about the relationship between the behavior
of outliers shown by different data objects. For example, in an Intrusion
Detection System, the DOS package from one system to another is taken as
normal
behavior. Therefore, if this happens with the various computer simultaneously,
it is considered abnormal behavior, and as a whole, they are called collective
outliers. The green data pointsas a whole represent the collective outlier.

Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a


context. For example, in the speech recognition technique, the single
background noise. Contextual outliers are also known as Conditional outliers.
These types of outliers happen if a data object deviates from the other data
points because of any specific condition in a given data set. As we know, there
are two types of attributes of objects of data: contextual attributes and
behavioral attributes. Contextual outlier analysis enables the users to examine
outliers in different contexts and conditions, which can be useful in various
applications. For example, A temperature reading of 45 degrees Celsius may
behave as an outlier in a rainy season. Still, it will behave like a normal data
point in the context of a summer season. In the given diagram, a green dot
representing the low- temperature value in June is a contextual outlier since the
same value in December is not an outlier.
Outliers Analysis

Outliers are discarded at many places when data mining is applied. But it is
still used in many applications like fraud detection, medical, etc. It is usually
because the events that occur rarely can store much more significant
information than the events that occur more regularly.

Other applications where outlier detection plays a vital role are given below.

Any unusual response that occurs due to medical treatment can be analyzed
through outlier analysis in data mining.

• Fraud detection in the telecom industry


• In market analysis, outlier analysis enables marketers to identify the
customer'sbehaviors.
• In the Medical analysis field.
• Fraud detection in banking and finance such as credit cards, insurance sector, etc.

The process in which the behavior of the outliers is identified in a dataset is


called outlier analysis. It is also known as "outlier mining", the process is
defined as a significant task of data mining.

Train and Test datasets

Machine Learning is one of the booming technologies across the world that
enables computers/machines to turn a huge amount of data into predictions.
However, these predictions highly depend on the quality of the data, and if we
are not using the right data for our model, then it will not generate the expected
result. In machine learning projects, we generally divide the original dataset
into training data and test data. We train our model over a subset of the original
dataset, i.e., the training dataset, and then evaluate whether it can generalize
well to the new or unseen dataset or test set. Therefore, train and test datasets
are the two key concepts of machine learning, where the training dataset is
used to fit the model, and the test dataset is used to evaluate the model.

In this topic, we are going to discuss train and test datasets along with the
difference between both of them. So, let's start with the introduction of the
training dataset and test dataset in Machine Learning.

What is Training Dataset?

The training data is the biggest (in -size) subset of the original dataset, which
is used to train or fit the machine learning model. Firstly, the training data is
fed to the ML algorithms, which lets them learn how to make predictions for
the given task.
For example, for training a sentiment analysis model, the training data could be as below:

Input Output (Labels)


The New UI is Great Positive
Update is really Slow Negative

The training data varies depending on whether we are using Supervised


Learning or Unsupervised Learning Algorithms.

What is Test Dataset?


Once we train the model with the training dataset, it's time to test the model
with the test dataset. This dataset evaluates the performance of the model and
ensures that the model can generalize well with the new or unseen dataset. The
test dataset is another subset of original data, which is independent of the
training dataset.

Need of Splitting dataset into Train and Test set

Splitting the dataset into train and test sets is one of the important parts of data
pre- processing, as by doing so, we can improve the performance of our model
and hence give better predictability.
We can understand it as if we train our model with a training set and then test
it with a completely different test dataset, and then our model will not be able
to understand the correlations between the features.

How do training and testing data work in Machine Learning?

Machine Learning algorithms enable the machines to make predictions and


solve problems on the basis of past observations or experiences. These
experiences or observations an algorithm can take from the training data, which
is fed to it. Further, one of the great things about ML algorithms is that they
can learn and improve over time on their own, as they are trained with the
relevant training data.

Once the model is trained enough with the relevant training data, it is tested
with the test data. We can understand the whole process of training and testing
in three steps, which are as follows:

• Feed: Firstly, we need to train the model by feeding it with training input data.
• Define: Now, training data is tagged with the corresponding outputs (in
Supervised Learning), and the model transforms the training data into text vectors or
a number of data features.
• Test: In the last step, we test the model by feeding it with the test data/unseen
dataset. This step ensures that the model is trained efficiently and can generalize well.

The above process is explained using a flowchart given below:

You might also like