Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Data Cleaning By Raju Gajelli

November 30, 2023

[22]: # Importing necessary libraries:

import numpy as np
import pandas as pd
import re
import warnings
warnings.filterwarnings('ignore')

[23]: df = pd.read_csv("Case Study - Data Cleansing_Python (1).xlsx - Raw Data.csv")

[ ]:

[24]: df

[24]: CompanyName WebAddress \


0 Evidera http://www.evidera.com
1 Microban International http://www.microban.com
2 Millers Print http://www.millerslab.com
3 Schumacher & Farley http://www.schumacherandfarley.com
4 HarperCollins Publishers http://www.harpercollins.com.au
.. … …
95 bp http://www.bp.com
96 SNCF http://www.sncf.com
97 NicholsBooth Architects http://www.nicholsbooth.com
98 FreeWheel http://www.freewheel.com
99 Acro Associates, Inc. http://www.acroassociates.com

ContactName Title Country \


0 Taras Kolcio Chief Financial Officer USA
1 Duane Centola RESEARCH USA
2 Jolon Riney Web Software Developer USA
3 David Schmieder Owner USA
4 Sukumar Manager Australia
.. … … …
95 Nicki Robson '-- UK
96 M?Lanie Wittmann D?partement Marketing France
97 Alodie Girmann Designer USA
98 Chamil Dezoysa Senior Software Engineer USA

1
99 Rob Robertson Graphic Designer USA

Email \
0 taras.kolcio@evidera.com
1 duane.centola@microban.com
2 jolonr@millerslab.com
3 d.schmieder@schumacherandfarley.com
4 sukumar@harpercollins-india.com
.. …
95 nicki.robson@bp.com
96 melanie.wittmann@sncf.com
97 girmanna@revelers.com
98 cdezoysa@freewheel.tv
99 rrobertson@norgren.com

Linkedin
0 http://www.linkedin.com/in/taras-kolcio-73ab0921
1 http://www.linkedin.com/in/duane-centola-86831712
2 NaN
3 NaN
4 NaN
.. …
95 http://www.linkedin.com/in/nicki-robson-2b9aa989
96 NaN
97 NaN
98 http://www.linkedin.com/in/chamildz
99 NaN

[100 rows x 7 columns]

[25]: # checking for how many columns are there

df.columns

[25]: Index(['CompanyName', 'WebAddress', 'ContactName', 'Title', 'Country', 'Email',


'Linkedin'],
dtype='object')

[ ]:

[26]: # Checking columns Data type , memory usage and null values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
# Column Non-Null Count Dtype

2
--- ------ -------------- -----
0 CompanyName 100 non-null object
1 WebAddress 100 non-null object
2 ContactName 100 non-null object
3 Title 100 non-null object
4 Country 100 non-null object
5 Email 100 non-null object
6 Linkedin 47 non-null object
dtypes: object(7)
memory usage: 5.6+ KB

[ ]:

1 Renaming Columns
[27]: df.rename(columns={"CompanyName":"Company" , "WebAddress":"Website" ,␣
↪"ContactName" : "Contact Name" },inplace = True)

[28]: df

[28]: Company Website \


0 Evidera http://www.evidera.com
1 Microban International http://www.microban.com
2 Millers Print http://www.millerslab.com
3 Schumacher & Farley http://www.schumacherandfarley.com
4 HarperCollins Publishers http://www.harpercollins.com.au
.. … …
95 bp http://www.bp.com
96 SNCF http://www.sncf.com
97 NicholsBooth Architects http://www.nicholsbooth.com
98 FreeWheel http://www.freewheel.com
99 Acro Associates, Inc. http://www.acroassociates.com

Contact Name Title Country \


0 Taras Kolcio Chief Financial Officer USA
1 Duane Centola RESEARCH USA
2 Jolon Riney Web Software Developer USA
3 David Schmieder Owner USA
4 Sukumar Manager Australia
.. … … …
95 Nicki Robson '-- UK
96 M?Lanie Wittmann D?partement Marketing France
97 Alodie Girmann Designer USA
98 Chamil Dezoysa Senior Software Engineer USA
99 Rob Robertson Graphic Designer USA

Email \

3
0 taras.kolcio@evidera.com
1 duane.centola@microban.com
2 jolonr@millerslab.com
3 d.schmieder@schumacherandfarley.com
4 sukumar@harpercollins-india.com
.. …
95 nicki.robson@bp.com
96 melanie.wittmann@sncf.com
97 girmanna@revelers.com
98 cdezoysa@freewheel.tv
99 rrobertson@norgren.com

Linkedin
0 http://www.linkedin.com/in/taras-kolcio-73ab0921
1 http://www.linkedin.com/in/duane-centola-86831712
2 NaN
3 NaN
4 NaN
.. …
95 http://www.linkedin.com/in/nicki-robson-2b9aa989
96 NaN
97 NaN
98 http://www.linkedin.com/in/chamildz
99 NaN

[100 rows x 7 columns]

2 Removing Spaces
[29]: df['Company'] = df['Company'].str.strip()

3 Replacing ‘?’ with an empty string in the ‘Contact Name’ col-


umn
[30]: df['Contact Name'] = df['Contact Name'].str.replace('?', '')

# This will remove all occurrences of '?' in the 'ContactName' column

4 Replacing ‘?’ with an empty string in the ’ Title ’ column


[31]: df['Title'] = df['Title'].str.replace('?', '')

# This will remove all occurrences of '?' in the 'Title' column

[ ]:

4
5 Checking for Null values
[32]: df.isnull().sum()

[32]: Company 0
Website 0
Contact Name 0
Title 0
Country 0
Email 0
Linkedin 53
dtype: int64

[ ]:

[33]: df["Linkedin"]=df["Linkedin"].fillna("")

[34]: df

[34]: Company Website \


0 Evidera http://www.evidera.com
1 Microban International http://www.microban.com
2 Millers Print http://www.millerslab.com
3 Schumacher & Farley http://www.schumacherandfarley.com
4 HarperCollins Publishers http://www.harpercollins.com.au
.. … …
95 bp http://www.bp.com
96 SNCF http://www.sncf.com
97 NicholsBooth Architects http://www.nicholsbooth.com
98 FreeWheel http://www.freewheel.com
99 Acro Associates, Inc. http://www.acroassociates.com

Contact Name Title Country \


0 Taras Kolcio Chief Financial Officer USA
1 Duane Centola RESEARCH USA
2 Jolon Riney Web Software Developer USA
3 David Schmieder Owner USA
4 Sukumar Manager Australia
.. … … …
95 Nicki Robson '-- UK
96 MLanie Wittmann Dpartement Marketing France
97 Alodie Girmann Designer USA
98 Chamil Dezoysa Senior Software Engineer USA
99 Rob Robertson Graphic Designer USA

Email \
0 taras.kolcio@evidera.com

5
1 duane.centola@microban.com
2 jolonr@millerslab.com
3 d.schmieder@schumacherandfarley.com
4 sukumar@harpercollins-india.com
.. …
95 nicki.robson@bp.com
96 melanie.wittmann@sncf.com
97 girmanna@revelers.com
98 cdezoysa@freewheel.tv
99 rrobertson@norgren.com

Linkedin
0 http://www.linkedin.com/in/taras-kolcio-73ab0921
1 http://www.linkedin.com/in/duane-centola-86831712
2
3
4
.. …
95 http://www.linkedin.com/in/nicki-robson-2b9aa989
96
97
98 http://www.linkedin.com/in/chamildz
99

[100 rows x 7 columns]

[35]: df.isnull().sum()

[35]: Company 0
Website 0
Contact Name 0
Title 0
Country 0
Email 0
Linkedin 0
dtype: int64

[ ]:

6 Ensuring Website start with ‘http://’ or ‘https://’


[36]: df['Website'] = df['Website'].apply(lambda x: 'http://' + x if not x.
↪startswith('http') else x)

6
6.1 Ensuring emails are in lowercase for consistency
[37]: df['Email'] = df['Email'].str.lower()

7 Checking and cleaning LinkedIn URLs


[38]: df['Linkedin'] = df['Linkedin'].apply(lambda x: x if str(x).startswith('http')␣
↪else None)

[39]: df

[39]: Company Website \


0 Evidera http://www.evidera.com
1 Microban International http://www.microban.com
2 Millers Print http://www.millerslab.com
3 Schumacher & Farley http://www.schumacherandfarley.com
4 HarperCollins Publishers http://www.harpercollins.com.au
.. … …
95 bp http://www.bp.com
96 SNCF http://www.sncf.com
97 NicholsBooth Architects http://www.nicholsbooth.com
98 FreeWheel http://www.freewheel.com
99 Acro Associates, Inc. http://www.acroassociates.com

Contact Name Title Country \


0 Taras Kolcio Chief Financial Officer USA
1 Duane Centola RESEARCH USA
2 Jolon Riney Web Software Developer USA
3 David Schmieder Owner USA
4 Sukumar Manager Australia
.. … … …
95 Nicki Robson '-- UK
96 MLanie Wittmann Dpartement Marketing France
97 Alodie Girmann Designer USA
98 Chamil Dezoysa Senior Software Engineer USA
99 Rob Robertson Graphic Designer USA

Email \
0 taras.kolcio@evidera.com
1 duane.centola@microban.com
2 jolonr@millerslab.com
3 d.schmieder@schumacherandfarley.com
4 sukumar@harpercollins-india.com
.. …
95 nicki.robson@bp.com
96 melanie.wittmann@sncf.com
97 girmanna@revelers.com

7
98 cdezoysa@freewheel.tv
99 rrobertson@norgren.com

Linkedin
0 http://www.linkedin.com/in/taras-kolcio-73ab0921
1 http://www.linkedin.com/in/duane-centola-86831712
2 None
3 None
4 None
.. …
95 http://www.linkedin.com/in/nicki-robson-2b9aa989
96 None
97 None
98 http://www.linkedin.com/in/chamildz
99 None

[100 rows x 7 columns]

[ ]:

You might also like