Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Why is machine learning important?

Machine learning is a form of artificial intelligence (AI) that teaches computers to think in a similar
way to humans: learning and improving upon past experiences. Almost any task that can be completed
with a data-defined pattern or set of rules can be automated with machine learning.

So, why is machine learning important? It allows companies to transform processes that were
previously only possible for humans to perform—think responding to customer service calls,
bookkeeping, and reviewing resumes for everyday businesses. Machine learning can also scale to handle
larger problems and technical questions—think image detection for self-driving cars, predicting natural
disaster locations and timelines, and understanding the potential interaction of drugs with medical

conditions before clinical trials. That’s why machine learning is important. Why is data
important for machine learning? Machine learning data analysis uses
algorithms to continuously improve itself over time, but quality data is necessary for these models to

operate efficiently What is a dataset in machine learning? A


single row of data is called an instance. Datasets are a collection of instances that all share a common
attribute. Machine learning models will generally contain a few different datasets, each used to fulfill
various roles in the system.

For machine learning models to understand how to perform various actions, training datasets must first
be fed into the machine learning algorithm, followed by validation datasets (or testing datasets) to ensure
that the model is interpreting this data accurately.

Once you feed these training and validation sets into the system, subsequent datasets can then be used to
sculpt your machine learning model going forward. The more data you provide to the ML system, the

faster that model can learn and improve. What type of data does
machine learning need?
Data can come in many forms, but machine learning models rely on four primary data types. These
include numerical data, categorical data, time series data, and text data.
Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your height, weight, or the
cost of your phone bill. You can determine if a set of data is numerical by attempting to average out the
numbers or sort them in ascending or descending order. Exact or whole numbers (ie. 26 students in a
class) are considered discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in mind that numerical
data is not tied to any specific point in time, they are simply raw numbers.

Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social class, ethnicity,
hometown, the industry you work in, or a variety of other labels. While learning this data type, keep in
mind that it is non-numerical, meaning you are unable to add them together, average them out, or sort
them in any chronological order. Categorical data is great for grouping individuals or ideas that share
similar attributes, helping your machine learning model streamline its data analysis.

Time series data


Time series data consists of data points that are indexed at specific points in time. More often than not,
this data is collected at consistent intervals. Learning and utilizing time series data makes it easy to
compare data from week to week, month to month, year to year, or according to any other time-based
metric you desire. The distinct difference between time series data and numerical data is that time series
data has established starting and ending points, while numerical data is simply a collection of numbers
that aren’t rooted in particular time periods.

Text data
Text data is simply words, sentences, or paragraphs that can provide some level of insight to your
machine learning models. Since these words can be difficult for models to interpret on their own, they
are most often grouped together or analyzed using various methods such as word frequency, text
classification, or sentiment analysis.

Where do engineers get datasets for machine


learning?
There is an abundance of places you can find machine learning data, but we have compiled five of the
most popular ML dataset resources to help get you started:
Different Forms of Data

Numeric Data : If a feature represents a characteristic measured in numbers , it is called a numeric


feature.

Categorical Data : A categorical feature is an attribute that can take on one of the limited , and usually
fixed number of possible values on the basis of some qualitative property . A categorical feature is also
called a nominal feature.

Ordinal Data : This denotes a nominal variable with categories falling in an ordered list . Examples
include clothing sizes such as small, medium , and large , or a measurement of customer satisfaction on
a scale from “not at all happy” to “very happy”.

Properties of Data –

Volume: Scale of Data. With the growing world population and technology at exposure, huge data is
being generated each and every millisecond.

Variety: Different forms of data – healthcare, images, videos, audio clippings.


Velocity: Rate of data streaming and generation.

Value: Meaningfulness of data in terms of information that researchers can infer from it.

Veracity: Certainty and correctness in data we are working on.

Advantages of using data in Machine Learning:

Improved accuracy,Automation,Personalization,Cost savings.

Disadvantages of using data in Machine Learning:

Bias,Privacy,Quality of data,Lack of interpretability.

Data Quality Definition

Data quality is the measure of how well suited a data set is to serve its specific purpose. Measures of
data quality are based on data quality characteristics such as accuracy, completeness, consistency,
validity, uniqueness, and timeliness.

What is Data Quality?

Data quality refers to the development and implementation of activities that apply quality management
techniques to data in order to ensure the data is fit to serve the specific needs of an organization in a
particular context. Data that is deemed fit for its intended purpose is considered high quality
data.Examples of data quality issues include duplicated data, incomplete data, inconsistent data,
incorrect data, poorly defined data, poorly organized data, and poor data security.Data quality
assessments are executed by data quality analysts, who assess and interpret each individual data quality
metric, aggregate a score for the overall quality of the data, and provide organizations with a percentage
to represent the accuracy of their data. A low data quality scorecard indicates poor data quality, which is
of low value, is misleading, and can lead to poor decision making that may harm the organization.Data
quality rules are an integral component of data governance, which is the process of developing and
establishing a defined, agreed-upon set of rules and standards by which all data across an organization is
governed. Effective data governance should harmonize data from various data sources, create and
monitor data usage policies, and eliminate inconsistencies and inaccuracies that would otherwise
negatively impact data analytics accuracy and regulatory compliance.

Data Quality Dimensions:There are six main dimensions of data quality: accuracy, completeness,
consistency, validity, uniqueness, and timeliness.

Accuracy: The data should reflect actual, real-world scenarios; the measure of accuracy can be
confirmed with a verifiable source.

Completeness: Completeness is a measure of the data’s ability to effectively deliver all the required
values that are available.

Consistency: Data consistency refers to the uniformity of data as it moves across networks and
applications. The same data values stored in difference locations should not conflict with one another.
Validity: Data should be collected according to defined business rules and parameters, and should
conform to the right format and fall within the right range.

Uniqueness: Uniqueness ensures there are no duplications or overlapping of values across all data sets.
Data cleansing and deduplication can help remedy a low uniqueness score.

Timeliness: Timely data is data that is available when it is required. Data may be updated in real time to
ensure that it is readily available and accessible.

How to Improve Data Quality

Data quality measures can be accomplished with data quality tools, which typically provide data quality
management capabilities such as:

Data profiling – The first step in the data quality improvement process is understanding your data. Data
profiling is the initial assessment of the current state of the data sets.

Data Standardization – Disparate data sets are conformed to a common data format.

Geocoding – The description of a location is transformed into coordinates that conform to U.S. and
worldwide geographic standards

Matching or Linking – Data matching identifies and merges matching pieces of information in big data
sets.

Data Quality Monitoring – Frequent data quality checks are essential. Data quality software in
combination with machine learning can automatically detect, report, and correct data variations based
on predefined business rules and parameters.

Batch and Real time – Once the data is initially cleansed, an effective data quality framework should be
able to deploy the same rules and processes across all applications and data types at scale.

Data Preprocessing in Machine learning

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning
the data and making it suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.

It Involves below steps:


Getting the dataset

Importing libraries

Importing datasets

Finding Missing Data

Encoding Categorical Data

Splitting dataset into training and test set

Feature scaling

1)Get the Dataset

To create a machine learning model, the first thing we required is a dataset as a machine learning model
completely works on data. The collected data for a particular problem in a proper format is known as
the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a machine
learning model for business purpose, then dataset will be different with the dataset required for a liver
patient. So each dataset is different from another dataset. To use the dataset in our code, we usually put
it into a CSV file. However, sometimes, we may also need to use an HTML or xlsx file.

2) Importing Libraries

These libraries are used to perform some specific jobs. There are three specific libraries that we will use
for data preprocessing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is
the fundamental package for scientific calculation in Python.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with this library,
we need to import a sub-library pyplot.

Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used
for importing and managing the datasets.

3) Importing the Datasets

Now we need to import the datasets which we have collected for our machine learning project. But
before importing a dataset, we need to set the current directory as a working directory. To set a working
directory in Spyder IDE, we need to follow the below steps:

1-Save your Python file in the directory which contains dataset.

2-Go to File explorer option in Spyder IDE, and select the required directory.

3-Click on F5 button or run option to execute the file.

4) Handling Missing data:


The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains
some missing data, then it may create a huge problem for our machine learning model. Hence it is
necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row:

By calculating the mean:

5) Encoding Categorical data:

Categorical data is data which has some categories such as, in our dataset; there are two categorical
variable, Country, and Purchased.Since machine learning model completely works on mathematics and
numbers, but if our dataset would have a categorical variable, then it may create trouble while building
the model. So it is necessary to encode these categorical variables into numbers.

6) Splitting the Dataset into the Training set and Test set

In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one
of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.Suppose, if we have given training to our machine learning model by a dataset
and we test it by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.If we train our model very well and its training
accuracy is also very high, but we provide a new dataset to it, then it will decrease the performance. So
we always try to make a machine learning model which performs well with the training set and also with
the test dataset. Here, we can define these datasets as:

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model
predicts the output.

7) Feature Scaling

Feature scaling is the final step of data preprocessing in machine learning. It is a technique to
standardize the independent variables of the dataset in a specific range. In feature scaling, we put our
variables in the same range and in the same scale so that no any variable dominate the other variable.

You might also like