Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 39

Important Notes on Data Science

 Define any 2 libraries in Python? Explain their importance


 What is a DataFrame? Explain the importance of Data Frame
 What is descriptive Statistics? Explain its importance
 What is meant by Data Cleaning? Explain any one process
 What is meant by Slicing? Explain with examples
 What is meant by Data Transformation? Explain with example
 What is meant by Outlier? Explain with example
 Explain with example, the term data science?
 What is meant by Descriptive Statistics? Explain with example
 What is meant by python libraries? Explain with example
 What is meant by Data Cleaning? Explain with example
 What is meant by ndarray? Explain with example
 What is meant by the term Data Frame? Explain with example
 Define Python Numpy? Explain its importance
 Explain any 1 libraries in Python? Explain its importance
 What is a Dataset? Explain the importance of Data set
 What is inferential Statistics? Explain its importance
 What is meant by Data cleaning? Explain any one process
 What is meant by indexing? Explain with examples
 What is meant by Data Reduction? Explain with example
 What is meant by Outlier? Explain with example
 Explain with example, the term data science?
 What is meant by missing data? Explain with example
 What is meant by python modules? Explain with example
 What is meant by Data transformation? Explain with example
 What is meant by array? Explain with example
 What is meant by the term Data Frame? Explain with example
 Define Pycharm? Explain its importance
 Explain any 2 libraries in Python? Explain its importance
 What is a Database? How is it different from DataFrame
 What is Data Reduction? Explain its importance
 What is meant by missing Data? Explain with examples
 What is meant by indexing? Explain with examples
 What is meant by Data Reduction? Explain with example
 What is meant by Outlier? Explain with example
 Explain with example, the term data science?
 What is meant by missing data? Explain with example
 What is meant by python modules? Explain with example
 What is meant by Data transformation? Explain with example
 What is meant by array? Explain with example
 What is meant by the term Data Frame? Explain with example

A CSV file contains the following fields


 Sr. No, Name, Date of Joining, City of Joining, Client, Salary
 You are required to address the following by writing appropriate code in python
 Retrieve the Data by using appropriate python libraries and modules
 Generate the report on missing values
 Clean the data for inaccurate values of date of joining

A CSV file contains the following fields


 Sr. No, Product Number, Manufacturing Date, Unit Price, Quantity ordered, Total
Price Paid
 You are required to address the following by writing appropriate code in python
 Retrieve the Data by using appropriate python libraries and modules
 Generate the report on inaccurate values of Quantity ordered, Total price paid
 Clean the data for missing manufacturing date by inserting XXXX in the missing
values of Manufacturing date

Python Programming Language

What is Python?

Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for: web development (server-side), software development, mathematics, system
scripting.
What can Python do? Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software development.

Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines than some other
programming languages.
Python runs on an interpreter system, meaning that code can be executed as soon as it is
written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-oriented way or a functional way.
Good to know
The most recent major version of Python is Python 3, which we shall be using in this tutorial.
However, Python 2, although not being updated with anything other than security updates, is
still quite popular.
In this tutorial Python will be written in a text editor. It is possible to write Python in an
Integrated Development Environment, such as Thonny, Pycharm, Netbeans or Eclipse which
are particularly useful when managing larger collections of Python files.
Python Syntax compared to other programming languages
Python was designed for readability, and has some similarities to the English language with
influence from mathematics.
Python uses new lines to complete a command, as opposed to other programming languages
which often use semicolons or parentheses.
Python relies on indentation, using whitespace, to define scope; such as the scope of loops,
functions and classes. Other programming languages often use curly-brackets for this
purpose.

Python Libraries

Normally, a library is a collection of books or is a room or place where many books are
stored to be used later. Similarly, in the programming world, a library is a collection of
precompiled codes that can be used later on in a program for some specific well-defined
operations. Other than pre-compiled codes, a library may contain documentation,
configuration data, message templates, classes, and values, etc.

A Python library is a collection of related modules. It contains bundles of code that can be
used repeatedly in different programs. It makes Python Programming simpler and convenient
for the programmer. As we don’t need to write the same code again and again for different
programs. Python libraries play a very vital role in fields of Machine Learning, Data Science,
Data Visualization, etc.

Working of Python Library

As is stated above, a Python library is simply a collection of codes or modules of codes that
we can use in a program for specific operations. We use libraries so that we don’t need to
write the code again in our program that is already available. But how it works. Actually, in
the MS Windows environment, the library files have a DLL extension (Dynamic Load
Libraries). When we link a library with our program and run that program, the linker
automatically searches for that library. It extracts the functionalities of that library and
interprets the program accordingly. That’s how we use the methods of a library in our
program. We will see further, how we bring in the libraries in our Python programs.

Python standard library


The Python Standard Library contains the exact syntax, semantics, and tokens of Python. It
contains built-in modules that provide access to basic system functionality like I/O and some
other core modules. Most of the Python Libraries are written in the C programming language.
The Python standard library consists of more than 200 core modules. All these work together
to make Python a high-level programming language. Python Standard Library plays a very
important role. Without it, the programmers can’t have access to the functionalities of
Python. But other than this, there are several other libraries in Python that make a
programmer’s life easier. Let’s have a look at some of the commonly used libraries:

 TensorFlow: This library was developed by Google in collaboration with the Brain
Team. It is an open-source library used for high-level computations. It is also used in
machine learning and deep learning algorithms. It contains a large number of tensor
operations. Researchers also use this Python library to solve complex computations in
Mathematics and Physics.
 Matplotlib: This library is responsible for plotting numerical data. And that’s why it is
used in data analysis. It is also an open-source library and plots high-defined figures
like pie charts, histograms, scatterplots, graphs, etc.
 Pandas: Pandas are an important library for data scientists. It is an open-source
machine learning library that provides flexible high-level data structures and a variety
of analysis tools. It eases data analysis, data manipulation, and cleaning of data.
Pandas support operations like Sorting, Re-indexing, Iteration, Concatenation,
Conversion of data, Visualizations, Aggregations, etc.
 Numpy: The name “Numpy” stands for “Numerical Python”. It is the commonly used
library. It is a popular machine learning library that supports large matrices and multi-
dimensional data. It consists of in-built mathematical functions for easy computations.
Even libraries like TensorFlow use Numpy internally to perform several operations on
tensors. Array Interface is one of the key features of this library.
 SciPy: The name “SciPy” stands for “Scientific Python”. It is an open-source library
used for high-level scientific computations. This library is built over an extension of
Numpy. It works with Numpy to handle complex computations. While Numpy allows
sorting and indexing of array data, the numerical data code is stored in SciPy. It is
also widely used by application developers and engineers.
 Scrapy: It is an open-source library that is used for extracting data from websites. It
provides very fast web crawling and high-level screen scraping. It can also be used for
data mining and automated testing of data.
 Scikit-learn: It is a famous Python library to work with complex data. Scikit-learn is
an open-source library that supports machine learning. It supports variously
supervised and unsupervised algorithms like linear regression, classification,
clustering, etc. This library works in association with Numpy and SciPy.
 PyGame: This library provides an easy interface to the Standard Directmedia Library
(SDL) platform-independent graphics, audio, and input libraries. It is used for
developing video games using computer graphics and audio libraries along with
Python programming language.
 PyTorch: PyTorch is the largest machine learning library that optimizes tensor
computations. It has rich APIs to perform tensor computations with strong GPU
acceleration. It also helps to solve application issues related to neural networks.
 PyBrain: The name “PyBrain” stands for Python Based Reinforcement Learning,
Artificial Intelligence, and Neural Networks library. It is an open-source library built
for beginners in the field of Machine Learning. It provides fast and easy-to-use
algorithms for machine learning tasks. It is so flexible and easily understandable and
that’s why is really helpful for developers that are new in research fields.
There are many more libraries in Python. We can use a suitable library for our purposes.
Hence, Python libraries play a very crucial role and are very helpful to the developers.

Use of Libraries in Python Program


As we write large-size programs in Python, we want to maintain the code’s modularity. For
the easy maintenance of the code, we split the code into different parts and we can use that
code later ever we need it. In Python, modules play that part. Instead of using the same code
in different programs and making the code complex, we define mostly used functions in
modules and we can just simply import them in a program wherever there is a requirement.
We don’t need to write that code but still, we can use its functionality by importing its
module. Multiple interrelated modules are stored in a library. And whenever we need to use a
module, we import it from its library. In Python, it’s a very simple job to do due to its easy
syntax. We just need to use import.

Python Lists
Python Lists are just like dynamically sized arrays, declared in other languages (vector in C+
+ and ArrayList in Java). In simple language, a list is a collection of things, enclosed in [ ]
and separated by commas.

The list is a sequence data type which is used to store the collection of data. Tuples and
String are other types of sequence data types.

Lists are the simplest containers that are an integral part of the Python language. Lists need
not be homogeneous always which makes it the most powerful tool in Python. A single list
may contain DataTypes like Integers, Strings, as well as Objects. Lists are mutable, and
hence, they can be altered even after their creation.

Creating a List in Python


Lists in Python can be created by just placing the sequence inside the square brackets[].
Unlike
Sets, a list doesn’t need a built-in function for its creation of a list.

Complexities for Creating Lists


Time Complexity: O(1)

Space Complexity: O(n)

Creating a list with multiple distinct or duplicate elements


A list may contain duplicate values with their distinct positions and hence, multiple distinct
or duplicate values can be passed as a sequence at the time of list creation.

# Creating a List with


# the use of Numbers
# (Having duplicate values)
List = [1, 2, 4, 4, 3, 3, 3, 6, 5]
print("\nList with the use of Numbers: ")
print(List)

# Creating a List with


# mixed type of values
# (Having numbers and strings)
List = [1, 2, 'Geeks', 4, 'For', 6, 'Geeks']
print("\nList with the use of Mixed Values: ")
print(List)

Accessing elements from the List


In order to access the list items refer to the index number. Use the index operator [ ] to access
an item in a list. The index must be an integer. Nested lists are accessed using nested
indexing.

Accessing elements from list

# Python program to demonstrate


# accessing of element from list

# Creating a List with


# the use of multiple values
List = ["Geeks", "For", "Geeks"]

# accessing a element from the


# list using index number
print("Accessing a element from the list")
print(List[0])
print(List[2])

Pandas DataFrames

What is a DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table


with rows and columns.

Create a simple Pandas DataFrame:

import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

Locate Row
The DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

Example
Return row 0:

#refer to the row index:


print(df.loc[0])

Slice() Function in Python

The slice() method extracts a section of data and returns it as new data, without modifying it.
This means users can take a specific range of elements without changing it.

Python List Slicing


In Python, list slicing is a common practice and it is the most used technique for
programmers to solve efficient problems. Consider a Python list, in order to access a range of
elements in a list, you need to slice a list. One way to do this is to use the simple slicing
operator i.e. colon(:). With this operator, one can specify where to start the slicing, where to
end, and specify the step. List slicing returns a new list from the existing list.

Python List Slicing Syntax


The format for list slicing is of Python List Slicing is as follows:

Lst[ Initial : End : IndexJump ]


If Lst is a list, then the above expression returns the portion of the list from index Initial to
index End, at a step size IndexJump.

Indexing in Python List


Indexing is a technique for accessing the elements of a Python List. There are various ways
by which we can access an element of a list.

Positive Indexes
In the case of Positive Indexing, the first element of the list has the index number 0, and the
last element of the list has the index number N-1, where N is the total number of elements in
the list (size of the list).

Positive Indexing of a Python List

Example:

In this example, we will display a whole list using positive index slicing.

# Initialize list
Lst = [50, 70, 30, 20, 90, 10, 50]

# Display list
print(Lst[::])
Output:

[50, 70, 30, 20, 90, 10, 50]

Negative Indexes
The below diagram illustrates a list along with its negative indexes. Index -1 represents the
last element and -N represents the first element of the list, where N is the length of the list.

Negative Indexing of a Python List

Example:

In this example, we will access the elements of a list using negative indexing.

# Initialize list
Lst = [50, 70, 30, 20, 90, 10, 50]

# Display list
print(Lst[-7::1])
Output:

[50, 70, 30, 20, 90, 10, 50]


Slicing
As mentioned earlier list slicing in Python is a common practice and can be used both with
positive indexes as well as negative indexes. The below diagram illustrates the technique of
list slicing:

Python List Slicing

Example:

In this example, we will transform the above illustration into Python code.

# Initialize list
Lst = [50, 70, 30, 20, 90, 10, 50]

# Display list
print(Lst[1:5])
Output:

[70, 30, 20, 90]


Examples of List Slicing in Python
Let us see some examples which depict the use of list slicing in Python.

Example 1: Leaving any argument like Initial, End, or IndexJump blank will lead to the use
of default values i.e. 0 as Initial, length of the list as End, and 1 as IndexJump.

# Initialize list
List = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Show original list


print("Original List:\n", List)

print("\nSliced Lists: ")

# Display sliced list


print(List[3:9:2])

# Display sliced list


print(List[::2])

# Display sliced list


print(List[::])
Output:

Original List:
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Sliced Lists:
[4, 6, 8]
[1, 3, 5, 7, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
Example 2: A reversed list can be generated by using a negative integer as the IndexJump
argument. Leaving the Initial and End as blank. We need to choose the Initial and End values
according to a reversed list if the IndexJump value is negative.

# Initialize list
List = ['Geeks', 4, 'geeks !']

# Show original list


print("Original List:\n", List)

print("\nSliced Lists: ")

# Display sliced list


print(List[::-1])

# Display sliced list


print(List[::-3])

# Display sliced list


print(List[:1:-2])
Output:

Original List:
['Geeks', 4, 'geeks !']

Sliced Lists:
['geeks !', 4, 'Geeks']
['geeks !']
['geeks !']
Example 3: If some slicing expressions are made that do not make sense or are incomputable
then empty lists are generated.

# Initialize list
List = [-999, 'G4G', 1706256, '^_^', 3.1496]

# Show original list


print("Original List:\n", List)
print("\nSliced Lists: ")

# Display sliced list


print(List[10::2])

# Display sliced list


print(List[1:1:1])

# Display sliced list


print(List[-1:-1:-1])

# Display sliced list


print(List[:0:])
Output:

Original List:
[-999, 'G4G', 1706256, '^_^', 3.1496]

Sliced Lists:
[]
[]
[]
[]
Example 4: List slicing can be used to modify lists or even delete elements from a list.

# Initialize list
List = [-999, 'G4G', 1706256, 3.1496, '^_^']

# Show original list


print("Original List:\n", List)

print("\nSliced Lists: ")

# Modified List
List[2:4] = ['Geeks', 'for', 'Geeks', '!']

# Display sliced list


print(List)

# Modified List
List[:6] = []

# Display sliced list


print(List)
Output:

Original List:
[-999, 'G4G', 1706256, 3.1496, '^_^']
Sliced Lists:
[-999, 'G4G', 'Geeks', 'for', 'Geeks', '!', '^_^']
['^_^']
Example 5: By concatenating sliced lists, a new list can be created or even a pre-existing list
can be modified.

# Initialize list
List = [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Show original list


print("Original List:\n", List)

print("\nSliced Lists: ")

# Creating new List


newList = List[:3]+List[7:]

# Display sliced list


print(newList)

# Changing existing List


List = List[::2]+List[1::2]

# Display sliced list


print(List)
Output:

Original List:
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Sliced Lists:
[1, 2, 3, 8, 9]
[1, 3, 5, 7, 9, 2, 4, 6, 8]

What is data reduction in Python?


Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.

Data Reduction in Data Mining


The method of data reduction may achieve a condensed description of the original data which
is much smaller in quantity but keeps the quality of the original data.

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining,
including:
Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it is important to be aware of the trade-off between the size and accuracy
of the data, and carefully assess the risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.

1. Data Cube Aggregation:


This technique is used to aggregate data in a simpler form. For example, imagine the
information you gathered for your analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve you in the annual sales, rather
than the quarterly average, So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.

Step-wise Forward Selection –


The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving time and making the
process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.

Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original data but are useful enough
to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:


Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.

Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat this method up to the end, then the process is known as
top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for
age) with high-level concepts (categorical variables such as middle age or Senior).

For numeric data following techniques can be followed:

Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of occurrences in
the data set.
Equal Width Partitioning: Partitioning the values in a fixed gap based on the number of bins
i.e. a set of values ranging from 0-20.
Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :


Data reduction in data mining can have a number of advantages and disadvantages.
Advantages:
Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to
work with large datasets.
Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This
can help to make the model more accurate and robust.
Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
Improved interpretability: Data reduction can help to improve the interpretability of the
results by removing irrelevant or redundant information from the dataset.
Disadvantages:
Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size
of the dataset can also remove important information that is needed for accurate predictions.
Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
Additional computational costs: Data reduction can add additional computational costs to the
data mining process, as it requires additional processing time to reduce the data.
In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it can also result in a loss of information, and make it harder to interpret
the results. It’s important to weigh the pros and cons of data reduction and carefully assess
the risks and benefits before implementing it.
Pandas DataFrame.transform
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations align on both row and
column labels. It can be thought of as a dict-like container for Series objects. This is the
primary data structure of the Pandas.

Pandas DataFrame.transform() function call func on self producing a DataFrame with


transformed values and that has the same axis length as self.

Syntax: DataFrame.transform(func, axis=0, *args, **kwargs)

Parameter :
func : Function to use for transforming the data
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
*args : Positional arguments to pass to func.
**kwargs : Keyword arguments to pass to func.

Returns : DataFrame

Example #1 : Use DataFrame.transform() function to add 10 to each element in the


dataframe.

# importing pandas as pd
import pandas as pd

# Creating the DataFrame


df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})

# Create the index


index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']

# Set the index


df.index = index_

# Print the DataFrame


print(df)
Output :

Now we will use DataFrame.transform() function to add 10 to each element of the dataframe.

# add 10 to each element of the dataframe


result = df.transform(func = lambda x : x + 10)

# Print the result


print(result)
Output :

As we can see in the output, the DataFrame.transform() function has successfully added 10 to
each element of the given Dataframe.

Example #2 : Use DataFrame.transform() function to find the square root and the result of
euler’s number raised to each element of the dataframe.

# importing pandas as pd
import pandas as pd

# Creating the DataFrame


df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})

# Create the index


index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']

# Set the index


df.index = index_

# Print the DataFrame


print(df)
Output :

Now we will use DataFrame.transform() function to find the square root and the result of
euler’s number raised to each element of the dataframe.

# pass a list of functions


result = df.transform(func = ['sqrt', 'exp'])

# Print the result


print(result)
Output :

As we can see in the output, the DataFrame.transform() function has successfully performed
the desired operation on the given dataframe.

Outliers in Python

Detect and Remove the Outliers using Python


Outliers, deviating significantly from the norm, can distort measures of central tendency and
affect statistical analyses. The piece explores common causes of outliers, from errors to
intentional introduction, and highlights their relevance in outlier mining during data analysis.
The article delves into the significance of outliers in data analysis, emphasizing their potential
impact on statistical results.

What is Outlier?
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal) objects. Identifying outliers is important in statistics and data analysis because they
can have a significant impact on the results of statistical analyses. The analysis for outlier
detection is referred to as outlier mining.

Outliers can skew the mean (average) and affect measures of central tendency, as well as
influence the results of tests of statistical significance.

How Outliers are Caused?


Outliers can be caused by a variety of factors, and they often result from genuine variability
in the data or from errors in data collection, measurement, or recording. Some common
causes of outliers are:

Measurement errors: Errors in data collection or measurement processes can lead to outliers.
Sampling errors: In some cases, outliers can arise due to issues with the sampling process.
Natural variability: Inherent variability in certain phenomena can also lead to outliers. Some
systems may exhibit extreme values due to the nature of the process being studied.
Data entry errors: Human errors during data entry can introduce outliers.
Experimental errors: In experimental settings, anomalies may occur due to uncontrolled
factors, equipment malfunctions, or unexpected events.
Sampling from multiple populations: Data is inadvertently combined from multiple
populations with different characteristics.
Intentional outliers: Outliers are introduced intentionally to test the robustness of statistical
methods.
Outlier Detection And Removal
Here pandas data frame is used for a more realistic approach as real-world projects need to
detect the outliers that arose during the data analysis step, the same approach can be used on
lists and series-type objects.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is preloaded in the Sklearn
library.

# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset


diabetics = load_diabetes()

# Create the dataframe


column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
print(df_diabetics.head())
Output:

age sex bmi bp s1 s2 s3 \


0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142

s4 s5 s6
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641
Outliers can be detected using visualization, implementing mathematical formulas on the
dataset, or using the statistical approach. All of these are discussed below.

Visualizing and Removing Outliers Using Box Plot


It captures the summary of the data effectively and efficiently with only a simple box and
whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can
just get insights(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])
Output:

Outliers present in the bmi columns


Outliers present in the bmi columns

In the above graph, can clearly see that values above 10 are acting as outliers.

import seaborn as sns


import matplotlib.pyplot as plt

def removal_box_plot(df, column, threshold):


sns.boxplot(df[column])
plt.title(f'Original Box Plot of {column}')
plt.show()

removed_outliers = df[df[column] <= threshold]

sns.boxplot(removed_outliers[column])
plt.title(f'Box Plot without Outliers of {column}')
plt.show()
return removed_outliers

threshold_value = 0.12

no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)


Output:

download
Box Plot

Visualizing and Removing Outliers Using Scatterplot


It is used when you have paired numerical data and when your dependent variable has
multiple values for each reading independent variable, or when trying to determine the
relationship between the two variables. In the process of utilizing the scatter plot, one can
also use it for outlier detection.

To plot the scatter plot one requires two variables that are somehow related to each other. So
here, ‘Proportion of non-retail business acres per town’ and ‘Full-value property-tax rate per
$10,000’ are used whose column names are “INDUS” and “TAX” respectively.

fig, ax = plt.subplots(figsize=(6, 4))


ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('(body mass index of people)')
ax.set_ylabel('(bp of the people )')
plt.show()
Output:

Scatter plot of bp and bmi

Looking at the graph can summarize that most of the data points are in the bottom left corner
of the graph but there are few points that are exactly opposite that is the top right corner of
the graph. Those points in the top right corner can be regarded as Outliers.

Using approximation can say all those data points that are x>20 and y>600 are outliers. The
following code can fetch the exact position of all those points that satisfy these conditions.

Removal of Outliers in BMI and BP Column Combined


Here, NumPy’s np.where() function is used to find the positions (indices) where the condition
(df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8) is true in the DataFrame df_diabetics.
The condition checks for outliers where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8.
The output provides the row and column indices of the outlier positions in the DataFrame.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

outlier_indices = np.where((df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8))


no_outliers = df_diabetics.drop(outlier_indices[0])

# Scatter plot without outliers


fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers['bmi'], no_outliers['bp'])
ax_no_outliers.set_xlabel('(body mass index of people)')
ax_no_outliers.set_ylabel('(bp of the people )')
plt.show()
Output:

scatterplot
Scatter plot

The outliers have been removed successfully.

Z-score
Z- Score is also called a standard score. This value/score helps to understand that how far is
the data point from the mean. And after setting up a threshold value one can utilize z score
values of data points to define the outliers.

Zscore = (data_point -mean) / std. deviation

In this example, we are calculating the Z scores for the ‘age’ column in the DataFrame
df_diabetics using the zscore function from the SciPy stats module. The resulting array z
contains the absolute Z scores for each data point in the ‘age’ column, indicating how many
standard deviations each value is from the mean.

from scipy import stats


import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)
Output:

0 0.800500
1 0.039567
2 1.793307
3 1.872441
4 0.113172
...
437 0.876870
438 0.115937
439 0.876870
440 0.956004
441 0.956004
Name: age, Length: 442, dtype: float64
Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the
data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Removal of Outliers with Z-Score


Let’s remove rows where Z value is greater than 2.

In this example, we sets a threshold value of 2 and then uses NumPy’s np.where() to identify
the positions (indices) in the Z-score array z where the absolute Z score is greater than the
specified threshold (2). It prints the positions of the outliers in the ‘age’ column based on the
Z-score criterion.

import numpy as np

threshold_z = 2

outlier_indices = np.where(z > threshold_z)[0]


no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)
Output:

Original DataFrame Shape: (442, 10)


DataFrame Shape after Removing Outliers: (426, 10)
IQR (Inter Quartile Range)
IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most
commonly used and most trusted approach used in the research field.

IQR = Quartile3 – Quartile1

Syntax: numpy.percentile(arr, n, axis=None, out=None)


Parameters :

arr :input array.


n : percentile value.
In this example, we are calculating the interquartile range (IQR) for the ‘bmi’ column in the
DataFrame df_diabetics. It first computes the first quartile (Q1) and third quartile (Q3) using
the midpoint method, then calculates the IQR as the difference between Q3 and Q1,
providing a measure of the spread of the middle 50% of the data in the ‘bmi’ column.

# IQR
Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')
Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)
Output

0.06520763046978838
To define the outlier base value is defined above and below dataset’s normal range namely
Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is
considered) :

upper = Q3 +1.5*IQR

lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR = IQR +
0.5*IQR) is taken, to consider all the data between 2.7 standard deviations in the Gaussian
Distribution.

# Above Upper bound


upper = Q3+1.5*IQR
upper_array = np.array(df_diabetics['bmi'] >= upper)
print("Upper Bound:", upper)
print(upper_array.sum())

# Below Lower bound


lower = Q1-1.5*IQR
lower_array = np.array(df_diabetics['bmi'] <= lower)
print("Lower Bound:", lower)
print(lower_array.sum())
Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0
Outlier Removal in Dataset using IQR
In this example, we are using the interquartile range (IQR) method to detect and remove
outliers in the ‘bmi’ column of the diabetes dataset. It calculates the upper and lower limits
based on the IQR, identifies outlier indices using Boolean arrays, and then removes the
corresponding rows from the DataFrame, resulting in a new DataFrame with outliers
excluded. The before and after shapes of the DataFrame are printed for comparison.

# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd

# Load the dataset


diabetes = load_diabetes()

# Create the dataframe


column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)

''' Detection '''


# IQR
# Calculate the upper and lower limits
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

# Create arrays of Boolean values indicating the outlier rows


upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]

# Removing the outliers


df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)

# Print the new shape of the DataFrame


print("New Shape: ", df_diabetes.shape)
Output:

Old Shape: (442, 10)


New Shape: (439, 10)

Conclusion
In conclusion, Visualization tools like box plots and scatter plots aid in identifying outliers,
and mathematical methods such as Z-scores and Inter Quartile Range (IQR) offer robust
approaches.

Frequently Asked Questions on Outlier Removal


Q. What is removing outliers in machine learning?
Removing outliers involves excluding data points significantly deviating from the norm to
enhance model accuracy and generalization on new data.

Q. What are the techniques to remove outliers?


Common techniques include visualization tools (box plots, scatter plots), mathematical
methods (Z-scores, IQR), and threshold-based filtering.

Q. What is the mean if the outlier is removed?


Removing outliers influences the mean, reducing its sensitivity to extreme values and
providing a more representative measure of central tendency.

Q. Why remove outliers from data?


Outliers can distort statistical analyses, affecting mean, variance, and other measures.
Removal improves model performance and data accuracy.

Q. What are different types of outliers in machine learning?


Outliers include global outliers (deviate from entire dataset) and local outliers (anomalous
within specific subgroups), influencing data integrity.

N-Dimensional array(ndarray) in Numpy


Array in Numpy is a table of elements (usually numbers), all of the same type, indexed by a
tuple of positive integers. In Numpy, number of dimensions of the array is called rank of the
array.A tuple of integers giving the size of the array along each dimension is known as shape
of the array. An array class in Numpy is called as ndarray. Elements in Numpy arrays are
accessed by using square brackets and can be initialized by using nested Python Lists.
Example :

[[ 1, 2, 3],
[ 4, 2, 5]]

Here, rank = 2 (as it is 2-dimensional or it has 2 axes)


First dimension(axis) length = 2, second dimension has length = 3
overall shape can be expressed as: (2, 3)
# Python program to demonstrate
# basic array characteristics
import numpy as np

# Creating array object


arr = np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )

# Printing type of arr object


print("Array is of type: ", type(arr))

# Printing array dimensions (axes)


print("No. of dimensions: ", arr.ndim)

# Printing shape of array


print("Shape of array: ", arr.shape)

# Printing size (total number of elements) of array


print("Size of array: ", arr.size)

# Printing type of elements in array


print("Array stores elements of type: ", arr.dtype)
Run on IDE
Output :

Array is of type: <class 'numpy.ndarray'>


No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64

Working with Missing Data in Pandas


Last Updated : 09 Feb, 2023
Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed. For Example, Suppose different users being surveyed may choose not to share their
income, some users may choose not to share the address in this way many datasets went
missing.
Pandas - Missing Data

In Pandas missing data is represented by two value:

None: None is a Python singleton object that is often used for missing data in Python code.
NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by
all systems that use the standard IEEE floating-point representation

Pandas treat None and NaN as essentially interchangeable for indicating missing or null
values. To facilitate this convention, there are several useful functions for detecting,
removing, and replacing null values in Pandas DataFrame :

isnull()
notnull()
dropna()
fillna()
replace()
interpolate()
In this article we are using CSV file, to download the CSV file used, Click Here.

Checking for missing values using isnull() and notnull()


In order to check missing values in Pandas DataFrame, we use a function isnull() and
notnull(). Both function help in checking whether a value is NaN or not. These function can
also be used in Pandas Series in order to find null values in a series.

Checking for missing values using isnull()


In order to check null values in Pandas DataFrame, we use isnull() function this function
return dataframe of Boolean values which are True for NaN values.

Code #1:

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)

# using isnull() function


df.isnull()
Output:
Code #2:

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# creating bool series True for NaN values


bool_series = pd.isnull(data["Gender"])

# filtering data
# displaying data only with Gender = NaN
data[bool_series]
Output: As shown in the output image, only the rows having Gender = NULL are displayed.

Checking for missing values using notnull()


In order to check null values in Pandas Dataframe, we use notnull() function this function
return dataframe of Boolean values which are False for NaN values.

Code #3:

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe using dictionary


df = pd.DataFrame(dict)

# using notnull() function


df.notnull()
Output:

Code #4:

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# creating bool series True for NaN values


bool_series = pd.notnull(data["Gender"])

# filtering data
# displaying data only with Gender = Not NaN
data[bool_series]
Output: As shown in the output image, only the rows having Gender = NOT NULL are
displayed.

Filling missing values using fillna(), replace() and interpolate()


In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help in
filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill
NA values in the dataframe but it uses various interpolation technique to fill the missing
values rather than hard-coding the value.

Code #1: Filling null values with a single value

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling missing value using fillna()


df.fillna(0)
Output:

Code #2: Filling null values with the previous ones

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling a missing value with


# previous ones
df.fillna(method ='pad')
Output:

Code #3: Filling null value with the next ones

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling null value using fillna() function


df.fillna(method ='bfill')
Output:

Code #4: Filling null values in CSV File

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")
# Printing the first 10 to 24 rows of
# the data frame for visualization
data[10:25]
Output

Now we are going to fill all the null values in Gender column with “No Gender”

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# filling a null values using fillna()


data["Gender"].fillna("No Gender", inplace = True)

data
Output:

Code #5: Filling a null values using replace() method

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# Printing the first 10 to 24 rows of


# the data frame for visualization
data[10:25]
Output:

Now we are going to replace the all Nan value in the data frame with -99 value.

Python
# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# will replace Nan value in dataframe with value -99


data.replace(to_replace = np.nan, value = -99)
Output:
Code #6: Using interpolate() function to fill the missing values using linear method.

Python
# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[None, 2, 54, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})

# Print the dataframe


df
Output:

Let’s interpolate the missing values using Linear method. Note that Linear method ignore the
index and treat the values as equally spaced.

Python
# to interpolate the missing values
df.interpolate(method ='linear', limit_direction ='forward')
Output:

As we can see the output, values in the first row could not get filled as the direction of filling
of values is forward and there is no previous value which could have been used in
interpolation.

Dropping missing values using dropna()


In order to drop a null values from a dataframe, we used dropna() function this function drop
Rows/Columns of datasets with Null values in different ways.

Code #1: Dropping rows with at least 1 null value.

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}
# creating a dataframe from dictionary
df = pd.DataFrame(dict)

df
Output

Now we drop rows with at least one Nan value (Null value)

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# using dropna() function


df.dropna()
Output:
Code #2: Dropping rows if all values in that row are missing.

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

df
Output

Now we drop a rows whose all data is missing or contain null values(NaN)
Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

df = pd.DataFrame(dict)

# using dropna() function


df.dropna(how = 'all')
Output:

Code #3: Dropping columns with at least 1 null value.

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

df
Output

Now we drop a columns which have at least 1 missing values

Python
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np

# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[60, 67, 68, 65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# using dropna() function


df.dropna(axis = 1)
Output :

Code #4: Dropping Rows with at least 1 null value in CSV file

Python
# importing pandas module
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# making new data frame with dropped NA values


new_data = data.dropna(axis = 0, how ='any')

new_data
Output:

Now we compare sizes of data frames so that we can come to know how many rows had at
least 1 Null value

Python
print("Old data frame length:", len(data))
print("New data frame length:", len(new_data))
print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))
Output :

Old data frame length: 1000


New data frame length: 764
Number of rows with at least 1 NA value: 236
Since the difference is 236, there were 236 rows which had at least 1 Null value in any
column.

Data science
Data science is an interdisciplinary academic field. that uses statistics, scientific computing,
scientific methods, processes, scientific visualization, algorithms and systems to extract or
extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

Data science also integrates domain knowledge from the underlying application domain (e.g.,
natural sciences, information technology, and medicine).Data science is multifaceted and can
be described as a science, a research paradigm, a research method, a discipline, a workflow,
and a profession.

Data science is "a concept to unify statistics, data analysis, informatics, and their related
methods" to "understand and analyze actual phenomena" with data.It uses techniques and
theories drawn from many fields within the context of mathematics, statistics, computer
science, information science, and domain knowledge. However, data science is different from
computer science and information science. Turing Award winner Jim Gray imagined data
science as a "fourth paradigm" of science (empirical, theoretical, computational, and now
data-driven) and asserted that "everything about science is changing because of the impact of
information technology" and the data deluge.

A data scientist is a professional who creates programming code and combines it with
statistical knowledge to create insights from data.

Python Modules
Last Updated : 20 Dec, 2023
Python Module is a file that contains built-in functions, classes,its and variables. There are
many Python modules, each with its specific work.

In this article, we will cover all about Python modules, such as How to create our own simple
module, Import Python modules, From statements in Python, we can use the alias to rename
the module, etc.

What is Python Module


A Python module is a file containing Python definitions and statements. A module can define
functions, classes, and variables. A module can also include runnable code.

Grouping related code into a module makes the code easier to understand and use. It also
makes the code logically organized.

Create a Python Module


To create a Python module, write the desired code and save that in a file with .py extension.
Let’s understand it better with an example:

Example:

Let’s create a simple calc.py in which we define two functions, one add and another subtract.

# A simple module, calc.py


def add(x, y):
return (x+y)
def subtract(x, y):
return (x-y)
Import module in Python
We can import the functions, and classes defined in a module to another module using the
import statement in some other Python source file.

When the interpreter encounters an import statement, it imports the module if the module is
present in the search path.

Note: A search path is a list of directories that the interpreter searches for importing a
module.

For example, to import the module calc.py, we need to put the following command at the top
of the script.

Syntax to Import Module in Python


import module
Note: This does not import the functions or classes directly instead imports the module only.
To access the functions inside the module the dot(.) operator is used.

Importing modules in Python Example

Now, we are importing the calc that we created earlier to perform add operation.

# importing module calc.py


import calc

print(calc.add(10, 2))
Output:

12

Python Import From Module


Python’s from statement lets you import specific attributes from a module without importing
the module as a whole.

Import Specific Attributes from a Python module


Here, we are importing specific sqrt and factorial attributes from the math module.

# importing sqrt() and factorial from the


# module math
from math import sqrt, factorial

# if we simply do "import math", then


# math.sqrt(16) and math.factorial()
# are required.
print(sqrt(16))
print(factorial(6))
Output:

4.0
720
Import all Names
The * symbol used with the import statement is used to import all the names from a module
to a current namespace.

Syntax:

from module_name import *


What does import * do in Python?
The use of * has its advantages and disadvantages. If you know exactly what you will be
needing from the module, it is not recommended to use *, else do so.

# importing sqrt() and factorial from the


# module math
from math import *

# if we simply do "import math", then


# math.sqrt(16) and math.factorial()
# are required.
print(sqrt(16))
print(factorial(6))
Output

4.0
720
Locating Python Modules
Whenever a module is imported in Python the interpreter looks for several locations. First, it
will check for the built-in module, if not found then it looks for a list of directories defined in
the sys.path. Python interpreter searches for the module in the following manner –

First, it searches for the module in the current directory.


If the module isn’t found in the current directory, Python then searches each directory in the
shell variable PYTHONPATH. The PYTHONPATH is an environment variable, consisting
of a list of directories.
If that also fails python checks the installation-dependent list of directories configured at the
time Python is installed.
Directories List for Modules
Here, sys.path is a built-in variable within the sys module. It contains a list of directories that
the interpreter will search for the required module.

# importing sys module


import sys

# importing sys.path
print(sys.path)
Output:

[‘/home/nikhil/Desktop/gfg’, ‘/usr/lib/python38.zip’, ‘/usr/lib/python3.8’,


‘/usr/lib/python3.8/lib-dynload’, ”, ‘/home/nikhil/.local/lib/python3.8/site-packages’,
‘/usr/local/lib/python3.8/dist-packages’, ‘/usr/lib/python3/dist-packages’,
‘/usr/local/lib/python3.8/dist-packages/IPython/extensions’, ‘/home/nikhil/.ipython’]

Renaming the Python Module


We can rename the module while importing it using the keyword.

Syntax: Import Module_name as Alias_name

# importing sqrt() and factorial from the


# module math
import math as mt

# if we simply do "import math", then


# math.sqrt(16) and math.factorial()
# are required.
print(mt.sqrt(16))
print(mt.factorial(6))
Output
4.0
720

Python Built-in modules


There are several built-in modules in Python, which you can import whenever you like.

# importing built-in module math


import math

# using square root(sqrt) function contained


# in math module
print(math.sqrt(25))

# using pi function contained in math module


print(math.pi)

# 2 radians = 114.59 degrees


print(math.degrees(2))

# 60 degrees = 1.04 radians


print(math.radians(60))

# Sine of 2 radians
print(math.sin(2))
# Cosine of 0.5 radians
print(math.cos(0.5))

# Tangent of 0.23 radians


print(math.tan(0.23))

# 1 * 2 * 3 * 4 = 24
print(math.factorial(4))

# importing built in module random


import random

# printing random integer between 0 and 5


print(random.randint(0, 5))

# print random floating point number between 0 and 1


print(random.random())

# random number between 0 and 100


print(random.random() * 100)

List = [1, 4, True, 800, "python", 27, "hello"]

# using choice function in random module for choosing


# a random element from a set such as a list
print(random.choice(List))

# importing built in module datetime


import datetime
from datetime import date
import time

# Returns the number of seconds since the


# Unix Epoch, January 1st 1970
print(time.time())

# Converts a number of seconds to a date object


print(date.fromtimestamp(454554))
Output:

5.0
3.14159265359
114.591559026
1.0471975512
0.909297426826
0.87758256189
0.234143362351
24
3
0.401533172951
88.4917616788
True
1461425771.87

You might also like