Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Statistics with Python

February 25, 2024

Assignment Webpage: immkavin-ranks.github.io/pd


Author: Kavin Manoharan kavinsde@skiff.com
Personal Website: kavinsde.netlify.app

1 Assignment - Importing Data into Pandas Dataframes


Pandas allows you to import data from a wide range of data sources directly into a
dataframe. These can be static files, such as CSV, TSV, fixed width files, Microsoft
Excel, JSON, SAS and SPSS files, as well as a range of popular databases, such as
MySQL, PostgreSQL and Google BigQuery. You can even scrape data directly from
web pages into Pandas dataframes.
When importing data into Pandas dataframes, you can also save time and write less
code by defining which columns to import, rename the columns, set their data types,
define the index, and many other things.

1.0.1 Load the Pandas Library


When importing the Pandas package the convention is to use the command import
pandas as pd which allows you to call Pandas functions by prefixing them with pd.
instead of pandas..
[1]: import pandas as pd

1.1 Reading CSV with Pandas


Comma Separated Value or CSV files are likely to be the file format you encounter most
commonly in data science. As the name suggests, these are simple text files in which
the values are separated (usually) by commas.

1.1.1 Reading a local CSV file


To import a CSV file and put the contents into a Pandas dataframe we use the
read_csv() function, which is appended after calling the pd object we created when
we imported Pandas. The read_csv() function can take several arguments, but by
default you just need to provide the path to the file you wish to read.
The Ice Cream Sales - temperatures.csv

1
Dataset Temperature and Ice Cream Sales - A public dataset relating temperature and
ice cream revenue
[2]: df = pd.read_csv('Data/Ice Cream Sales - temperatures.csv')

Reading the data


• One of the most used method for getting a quick overview of the DataFrame, is the head()
method.
• The head() method returns the headers and a specified number of rows, starting from the
top. Default argument value = 5
[3]: df.head()

[3]: Temperature Ice Cream Profits


0 39 13.17
1 40 11.88
2 41 18.82
3 42 18.65
4 43 17.02

Renaming Column Headers on Import


To rename the columns, we simply use read_csv() to load the file and then pass in a
list of the new names to the names argument, and use skiprows to ignore the first row
of the file which contains the old column names.
[4]: df = pd.read_csv('Data/Ice Cream Sales - temperatures.csv', names=['Temp in␣
↪Celcius', 'Ice Cream Sale Profits'], skiprows=1)

df.head(2)

[4]: Temp in Celcius Ice Cream Sale Profits


0 39 13.17
1 40 11.88

1.1.2 Reading a remote CSV file


You can also use read_csv() to read remote CSV files. Instead of passing in the path
to the file you provide the full URL to the CSV file. Here, I’m loading a CSV file from
my GitHub account.

Defining which fields to import


You can use the usecols argument to define the list of column names to import.
[5]: df = pd.read_csv('https://raw.githubusercontent.com/immkavin-ranks/pd/main/Data/
↪spotify_long_tracks_2014_2024.csv',\

usecols=['Name', 'Artists'])
df.head(3)

2
[5]: Name Artists
0 Steady Rain in a Forest with Light Background … Nature Sounds
1 Soundarya Lahari Mambalam Sisters
2 Waves of Abundance & Fullfillment Zen Life Relax

1.2 Reading Microsoft Excel files with Pandas


Pandas includes built-in functionality for reading Microsoft Excel spreadsheet files via
its read_excel() function. This works in the same way as read_csv() so can be used
on local Excel documents as well as remote files, however, as Excel files are a bit more
bloated than CSV files, it can be a bit slower.

Specifying the number of rows to read


To restrict the number of rows that are read in you can pass an integer representing the
number of rows to the nrows argument of read_csv().

[6]: df = pd.read_excel('Data/team-players.xlsx', nrows=3)


df.head()

[6]: Player Role Price


0 Ambati Rayudu (R) Batsman 2.20 crore
1 Monu Kumar (R) Batsman 20 lakhs
2 Murali Vijay (R) Batsman 2 crore

1.3 Reading HTML with Pandas


One other handy feature of Pandas is the read_html() function. This allows you to
parse HTML markup from remote web pages or local HTML documents and extract
any tables present.
The read_html() function returns any tables it finds in a list, so if more than one is
present, you’ll need to define which one to display in your dataframe using its list index,
which starts from zero.
[7]: data = pd.read_html('https://en.wikipedia.org/wiki/
↪Comparison_of_programming_languages#General_comparison')

data[1].head()

[7]: Language \
0 1C:Enterprise programming language
1 ActionScript
2 Ada
3 Aldor
4 ALGOL 58

Original purpose Imperative \


0 Application, RAD, business, general, web, mobile Yes
1 Application, client-side, web Yes

3
2 Application, embedded, realtime, system Yes
3 Highly domain-specific, symbolic computing Yes
4 Application Yes

Object-oriented Functional Procedural Generic Reflective \


0 No Yes Yes Yes Yes
1 Yes Yes Yes No No
2 Yes[2] No Yes[3] Yes[4] No
3 Yes Yes No No No
4 No No No No No

Other paradigms \
0 Object-based, Prototype-based programming
1 prototype-based
2 Concurrent,[5] distributed,[6]
3 NaN
4 NaN

Standardized?
0 No
1 Yes 1999-2003, ActionScript 1.0 with ES3, Acti…
2 Yes 1983, 2005, 2012, ANSI, ISO, GOST 27831-88[7]
3 No
4 No

1.4 Reading SQL Database with Pandas


You can use read_sql() to read db files. To connect database file we import sqlite3.

[8]: import sqlite3

conn = sqlite3.connect('Data/favorites.db')

# Read the data from the database into a DataFrame


df = pd.read_sql('SELECT * FROM shows', conn)

# Close the connection to the database


conn.close()

print(df)

id title
0 1 How I Met Your Mother
1 2 The Sopranos
2 3 Friday Night Lights
3 4 Family Guy
4 5 New Girl
.. … …

4
152 153 Sherlock
153 154 Anne With An E
154 155 Money Heist
155 156 Succession
156 157 Silicon Valley

[157 rows x 2 columns]

1.5 Requirments
• Pandas
• Sqlite3
• Openpyxl

1.6 References https://github.com/immkavin-ranks/pd


1. Datasets from Kaggle
• Temperature and Ice Cream Sales
• Spotify’s Long Hits (2014-2024)
• IPL-2020 Dataset
2. Dataset from Wikipedia - Comparison of Programming Languages
3. Dataset from CS50x 2023 Practice Problem-Favorites
4. An Article from Practical Data Science
5. W3Schools Pandas Tutorial

You might also like