Pandas Interview Questions

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

For an aspiring data scientist, the most important data analytics libraries you need to be well-versed

with are – NumPy and Pandas. We have already covered NumPy Interview Questions for Data
Scientists. This article will help you brush up on your foundations of the Pandas library and prepare
you for the most common Pandas Interview Questions that come up during Data Science interviews.

We will be covering the following sections today:

 Quick Intro to Pandas


 Pandas Interview Questions

Quick Intro to Pandas

Pandas is an open-source Python library and the most popular package for data pre-processing in
Data Science. It offers extensive features and powerful built-in functions that help with tasks such
as data cleaning, data manipulation, data modeling, and analysis.

Let’s discuss the common Pandas-related questions you might come across during your interviews.

Pandas Interview Questions


What is meant by the mutability of objects?

Mutability refers to the ability to modify a portion of a data structure without needing to recreate it.
Examples of mutable objects are lists, sets, and values in a dictionary.

Similarly, objects are said to be immutable if they cannot be modified once created. Examples of
immutable objects are integers, floats, strings, tuples, and keys of a dictionary.

1. Discuss Pandas Data Structures.


Data Structures are basically a specialized way of organizing, processing, and storing data in a
manner that one can operations on it.

Pandas offer two main data structures –

Pandas Series:

 1D labeled array.
 Homogeneous i.e, the data types of series elements are the same.
 Size-immutable i.e., the size of series objects cannot be changed once created.

#pandas.DataFrame

df = pd.DataFrame(data=None, index=None, columns=None, dtype=None)

Inde
Data
x

0 element 1

1 element 2

2 element 3

3 element 4

Pandas DataFrames:

 2D labeled tabular structure.


 Heterogeneous i.e, the data frame elements can have different data types.
 Size-mutable i.e., elements can be added or dropped from an existing data frame.

Index COlumn 1 COLUMN 2


0 element 1 element a

1 element 2 element b

2 element 3 element c

3 element 4 element d

import pandas as pd

#Creating a sample DataFrame

df = pd.DataFrame({

'id': [ 101, 129, 120, 126, 136, 125],

'age': [ 30, 32, 43, 45, 29, 46],

'gender': ['F', 'M', 'M', 'M', 'F', 'F']

})

df

2. How do you get the column names of your DataFrame?


import pandas as pd

#Creating a sample DataFrame

df = pd.DataFrame({

'id': [ 101, 129, 120, 126, 136, 125],

'age': [ 30, 32, 43, 45, 29, 46],

'gender': ['F', 'M', 'M', 'M', 'F', 'F']

})
df

Now, to get the column names of the DataFrame, we can try multiple
ways –

 Iterating over columns and printing them:

for col in df.columns:

print(col)

 Using columns() method that returns the column labels of the DataFrame:

list(df.columns)

 Using column.values() method that returns an array of indices:

list(df.columns.values)

 Using a sorted() method that returns the list of columns sorted in alphabetical order:

sorted(df)
3. How do you get the number of rows and columns of a Pandas DataFrame?

You can find this simply through the shape attribute which gives you the shape of the DataFrame:

df.shape

Thus, there are 6 rows and 3 columns in the DataFrame df.

4. How can you slice a Pandas DataFrame?

Slicing simply means extracting a subset of the DataFrame. To do this, you can just pass the relevant
columns into the slicer as shown:

df[['age','id']]

5. How do you convert a Pandas Series to a DataFrame?

You can use the Series.to_frame() method to convert the series object to a DataFrame:

series = pd.Series(['butter','dynamite','daydream', 'friends'])

col = 'bts'

series.to_frame(name=col)
6. How do you convert a String to DateTime?
from datetime import datetime

#Define dates in string format

str1 = 'Wednesday, April 20, 2022'

str2 = '4/20/22'

str3 = '04-20-2022'

#Define dates as datetime objects

date1 = datetime.strptime(str1, '%A, %B %d, %Y')

date2 = datetime.strptime(str2, '%m/%d/%y')

date3 = datetime.strptime(str3, '%m-%d-%Y')

#Print the converted dates

print(date1)

print(date2)

print(date3)
You can use the datetime.strptime() method to convert a string to DateTime objects:

7. How do you add a single row to an existing DataFrame?

 Using a Series:

#Appending a single row with Series

series = pd.Series([272, 22,'M'],

index=['id', 'age', 'gender']

df.append(series, ignore_index=True)

 Using a Dictionary:

Please note that in this case, we would match the dictionary keys with the DataFrame columns:

#Appending a single row with a dictionary


dict = {'id': 169, 'age': 54, 'gender': 'M'}

df.append(dict, ignore_index=True)

8. How do you add multiple rows to an existing DataFrame?

 Using a DataFrame:

#Creating a new DataFrame that we want to append

df2 = pd.DataFrame({

'id': [1221, 1200, 1230],

'age': [36,53,26],

'gender': ['M','F','M'],

})

#Appending multiple rows with DataFrame

df.append(df2)
 Using a List of Dictionaries:

#Creating a list of lists

mylist = [

[1434, 26, 'M'],

[1578, 34, 'M'],

[1301, 23, 'F'],

[1207, 54, 'M']

#Converting a list of lists to a DataFrame

df2 = pd.DataFrame(mylist, columns=df.columns)

#Appending multiple rows with a list of lists

df.append(df2, ignore_index=True)
 Using a List of Lists:

#Creating a list of lists

mylist = [

[1434, 26, 'M'],

[1578, 34, 'M'],

[1301, 23, 'F'],

[1207, 54, 'M']

#Converting a list of lists to a DataFrame

df2 = pd.DataFrame(mylist, columns=df.columns)

#Appending multiple rows with a list of lists


df.append(df2, ignore_index=True)

9. How can you find the sum of values of a DataFrame column?


You can use the sum() function to return the sum of the values. This is done with the help of the
following attributes:

 axis: specifies whether to sort for row (0) or columns (1).


 skipna: is set to True to skip all missing values.

df.sum(axis = 0, skipna = True)


Note: If the input is index axis, then all the values in a column are added and the same gets repeated
for all the columns. This returns a series containing the sum of all the values in each column.

10. How and why should one create a copy of a DataFrame?


You can simply use the copy() method to copy your existing DataFrame:

df.copy()

During data manipulation, it is generally a safe practice to perform operations on the copy of the
DataFrame rather than the original one. Even if you work with a subset of the DataFrame, any
changes made to the subset would reflect on the original DataFrame because in Pandas, indexing a
DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the
initial DataFrame.

So, to ensure that the initial DataFrame remains intact, you’d want to use copies and merge them all
at the end.

What are loc and iloc methods in Pandas?

The loc[ ] method

It is a label-based method that is used for selecting data and updating it. You can do so by passing
the row or column label you want to select.

Syntax:

loc[row_label, column_label]

The iloc[ ] method

The iloc[ ] is an index-based method that is used for data selection. Here, you can pass the row or
column positions that you want to select (0-based integer index).

Syntax: iloc[row_position, column_position]

Learn more about the difference between loc and iloc in Pandas.
11. How do you sort a DataFrame?
You can use the sort.values() method to sort a DataFrame by its rows or columns. This can be done
with the help of the following attributes:

 by: specifies the column/row(s) that determine the sorting order.


 axis: specifies whether to sort for row (0) or columns (1).
 ascending: by default, a DataFrame is sorted in ascending order. You can specify
ascending=False to sort in descending order.

#Sort by a single column

df.sort_values(by=['age'])

#Sort by multiple columns

df.sort_values(by=['age', 'id'])
#Sort in descending order

df.sort_values(by='age', ascending=False)

12. How can you merge two DataFrames?


You can use the merge() method to combine you DataFrames with SQL-style join methods. There
are 4 types of joins that you can pass to an optional argument how –

 inner (default): output DataFrame consists of records that have the same values in both
DataFrames:
#Creating two dataframes

data1 = pd.DataFrame({"Roll No.": [101, 103, 104, 107, 106],

"Gender": [ 'M', 'F', 'F', 'M', 'M'],

"Name": ['Rob', 'Pam', 'Max', 'Tom', 'Ben']

})

data2 = pd.DataFrame({"Roll No.": [101, 102, 103, 104, 105],

"Age": [31, 35, 36, 30, 38]

})

#Inner merge

df_merge = pd.merge(left=data1, right=data2, how='inner')

df_merge

 outer: output DataFrame consists of all records with matching values in either the left or right
DataFrame. If rows match, values are shown. Else, NaN is displayed:
#Outer merge

df_merge = pd.merge(left=data1, right=data2, how=’outer’)

df_merge

 left: output DataFrame consists of all records from the left DataFrame and only the matched
records from the right DataFrame:

#Left join

df_merge = pd.merge(left=data1, right=data2, how='left')

df_merge
 right: output DataFrame consists of all records from the right DataFrame and only the
matched records from the left DataFrame:

#Right join

df_merge = pd.merge(left=data1, right=data2, how='right')

df_merge

13. How do you deal with a dataset consisting of missing values?


For larger datasets, the quickest way to deal with missing values is to simply remove the entire rows
containing the missing values as this will not affect our predictions in any way.
For smaller datasets, you can substitute the missing values with the average/mean values or zeros by
performing operations on the dataframe. For this, you can use methods such as mean(), fillna(0).

df_merge.fillna(0)

14. What is the difference between fillna() and interpolate()


methods?
The fillna() method fills the missing or NaN values with a substitute value specified by you. You
have the option to fill according to the row indices of a DataFrame or on the column names in the
form of a Python dictionary.

On the other hand, the interpolate() method allows you to fill your missing or NaN values with many
kinds of interpolations between the values like linear, time, etc.

15. How to group DataFrame rows in a list?


You can use the groupby() method to group on the specified column and then use the apply() method
to apply a list to every group:

df1 = df.groupby(‘age’)[‘gender’].apply(list).reset_index(name=’new’)

df1
16. How do you delete a column or a row from a Pandas
DataFrame?
You can use the drop() method to delete a column from the DataFrame. This can be done with the
help of the following attributes:

 axis: specifies whether to sort for row (0) or columns (1).


 inplace: can be set to True to delete a column without reassigning the DataFrame.
 ascending: by default a DataFrame is sorted in ascending order. You can specify
ascending=False to sort in descending order.

#Dropping a single column

df.drop(['gender'], axis = 1)
#Dropping multiple columns

df.drop(['gender', 'id'], axis = 1)

#Dropping multiple columns

df.drop(['gender', 'id'], axis = 1)

Note: You can also remove duplicate values from the column by using the drop_duplicates()
method.

17. How can you encode a large Pandas DataFrame?


For a DataFrame with a lot of columns (aka features), encoding a label to each feature is not
efficient. So, we can use a LabelEncoder to encode a Pandas DataFrame having string or numerical
labels. This can be done using Python’s scikit-learn library:

OneHotEncoder().fit_transform(df)

18. How can you split your Pandas DataFrame into training and
testing sets?
You can use Python’s scikit-learn library to split both NumPy arrays and Pandas DataFrames into
training and testing sets to be used for creating ML models. For this, you can import train_test_split
to perform the split:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

The above code splits the DataFrame into 20% testing set and 80% training set.

You might also like