Professional Documents
Culture Documents
Pandas Interview Questions
Pandas Interview Questions
Pandas Interview Questions
with are – NumPy and Pandas. We have already covered NumPy Interview Questions for Data
Scientists. This article will help you brush up on your foundations of the Pandas library and prepare
you for the most common Pandas Interview Questions that come up during Data Science interviews.
Pandas is an open-source Python library and the most popular package for data pre-processing in
Data Science. It offers extensive features and powerful built-in functions that help with tasks such
as data cleaning, data manipulation, data modeling, and analysis.
Let’s discuss the common Pandas-related questions you might come across during your interviews.
Mutability refers to the ability to modify a portion of a data structure without needing to recreate it.
Examples of mutable objects are lists, sets, and values in a dictionary.
Similarly, objects are said to be immutable if they cannot be modified once created. Examples of
immutable objects are integers, floats, strings, tuples, and keys of a dictionary.
Pandas Series:
1D labeled array.
Homogeneous i.e, the data types of series elements are the same.
Size-immutable i.e., the size of series objects cannot be changed once created.
#pandas.DataFrame
Inde
Data
x
0 element 1
1 element 2
2 element 3
3 element 4
Pandas DataFrames:
1 element 2 element b
2 element 3 element c
3 element 4 element d
import pandas as pd
df = pd.DataFrame({
})
df
df = pd.DataFrame({
})
df
Now, to get the column names of the DataFrame, we can try multiple
ways –
print(col)
Using columns() method that returns the column labels of the DataFrame:
list(df.columns)
list(df.columns.values)
Using a sorted() method that returns the list of columns sorted in alphabetical order:
sorted(df)
3. How do you get the number of rows and columns of a Pandas DataFrame?
You can find this simply through the shape attribute which gives you the shape of the DataFrame:
df.shape
Slicing simply means extracting a subset of the DataFrame. To do this, you can just pass the relevant
columns into the slicer as shown:
df[['age','id']]
You can use the Series.to_frame() method to convert the series object to a DataFrame:
col = 'bts'
series.to_frame(name=col)
6. How do you convert a String to DateTime?
from datetime import datetime
str2 = '4/20/22'
str3 = '04-20-2022'
print(date1)
print(date2)
print(date3)
You can use the datetime.strptime() method to convert a string to DateTime objects:
Using a Series:
df.append(series, ignore_index=True)
Using a Dictionary:
Please note that in this case, we would match the dictionary keys with the DataFrame columns:
df.append(dict, ignore_index=True)
Using a DataFrame:
df2 = pd.DataFrame({
'age': [36,53,26],
'gender': ['M','F','M'],
})
df.append(df2)
Using a List of Dictionaries:
mylist = [
df.append(df2, ignore_index=True)
Using a List of Lists:
mylist = [
df.copy()
During data manipulation, it is generally a safe practice to perform operations on the copy of the
DataFrame rather than the original one. Even if you work with a subset of the DataFrame, any
changes made to the subset would reflect on the original DataFrame because in Pandas, indexing a
DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the
initial DataFrame.
So, to ensure that the initial DataFrame remains intact, you’d want to use copies and merge them all
at the end.
It is a label-based method that is used for selecting data and updating it. You can do so by passing
the row or column label you want to select.
Syntax:
loc[row_label, column_label]
The iloc[ ] is an index-based method that is used for data selection. Here, you can pass the row or
column positions that you want to select (0-based integer index).
Learn more about the difference between loc and iloc in Pandas.
11. How do you sort a DataFrame?
You can use the sort.values() method to sort a DataFrame by its rows or columns. This can be done
with the help of the following attributes:
df.sort_values(by=['age'])
df.sort_values(by=['age', 'id'])
#Sort in descending order
df.sort_values(by='age', ascending=False)
inner (default): output DataFrame consists of records that have the same values in both
DataFrames:
#Creating two dataframes
})
})
#Inner merge
df_merge
outer: output DataFrame consists of all records with matching values in either the left or right
DataFrame. If rows match, values are shown. Else, NaN is displayed:
#Outer merge
df_merge
left: output DataFrame consists of all records from the left DataFrame and only the matched
records from the right DataFrame:
#Left join
df_merge
right: output DataFrame consists of all records from the right DataFrame and only the
matched records from the left DataFrame:
#Right join
df_merge
df_merge.fillna(0)
On the other hand, the interpolate() method allows you to fill your missing or NaN values with many
kinds of interpolations between the values like linear, time, etc.
df1 = df.groupby(‘age’)[‘gender’].apply(list).reset_index(name=’new’)
df1
16. How do you delete a column or a row from a Pandas
DataFrame?
You can use the drop() method to delete a column from the DataFrame. This can be done with the
help of the following attributes:
df.drop(['gender'], axis = 1)
#Dropping multiple columns
Note: You can also remove duplicate values from the column by using the drop_duplicates()
method.
OneHotEncoder().fit_transform(df)
18. How can you split your Pandas DataFrame into training and
testing sets?
You can use Python’s scikit-learn library to split both NumPy arrays and Pandas DataFrames into
training and testing sets to be used for creating ML models. For this, you can import train_test_split
to perform the split:
The above code splits the DataFrame into 20% testing set and 80% training set.