Happay

SQL
Q1)
Created Tables PostgreSQL and finding out answers.
a. a. list the last 25% employees (as per their salary)
Explanation:
The query orders employees by salary in descending order and then I used offset to skip first 75 %
leaving only last 25% in terms of salary
Output:
b. list all the employees who have salary greater than the average salary of the entire dataset
Explanation :
I have used a subquery which calculates the average salary .The subquery gets executed first and this
subquery is used as a filter to main query to get the requirement.
Output:
c. Employees with Salaries Higher Than Their Departmental Average
Explanation:
The inner subquery calculates the average using window and partitioned on basis of department to
get each department average .The main query uses this as a filter.
Output:
d. Find the Duplicate Rows
Explanation:
query identifies and lists duplicates in basis of combination of employee_id,first_name,last_name
and dept_id.
The data given had zero duplicates on basis of employee_id,first_name,last_name and dept_id.
e. Find the employee with the second highest salary
explanation:
Query orders on basis of salary in descending order then selects the second highest using limit and
offset
Q2:
Explanation:
The query uses sum() as a window function and ordering on basis of transaction date.
Output:
Q3
Explanation:
Here I created a CTE(common table expression) named LatestEmployment which selects the
emp_id,emp_profile and emp_join_date for each employe where their emp_join_date is latest using
max().
Then I have left joined with employee_table to get name and other requirements.
Output:
Q4
Here I have used left join as I considered employee table to be main table and
wanted to get all details inspite of having null .
If I had used inner join then nikita would have been left out and now I can
change my data and add the details for nikita.
PYTHON:
Python P1
● Import the necessary libraries
● Import the first dataset cars1 and cars2.
url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars1.csv"
cars1 = pd.read_csv(url1)
cars1
url2="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars2.csv"
cars2=pd.read_csv(url2)
cars2
● Oops, it seems our first dataset has some unnamed blank columns, fix
cars1
● What is the number of observations in each dataset?
● Join cars1 and cars2 into a single DataFrame called cars
● Oops, there is a column missing, called owners. Create a random number

Series from 15,000 to 73,000.
● Add the column owners to cars
Overall code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import seaborn as sns
##cars1
url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars1.csv"
cars1 = pd.read_csv(url1)
cars1
##cars2
url2="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars2.csv"
cars2=pd.read_csv(url2)
cars2
#Removing unnamed columns in cars1
# Remove unnamed blank columns (columns with no header)
cars1 = cars1.loc[:, ~cars1.columns.str.contains('^Unnamed')]
cars1
##Observations
observations_cars1 = len(cars1)
observations_cars2 = len(cars2)
print("Number of observations in cars1:", observations_cars1)
print("Number of observations in cars2:", observations_cars2)
##cars
cars=pd.concat([cars1,cars2])
cars
##owner
random_owners = pd.Series([random.randint(15000, 73000) for _ in range(len(cars))])
random_owners
##Adding owners to cars df
cars['owners'] = random_owners
cars
Python P2
● Import the necessary libraries
● Import the dataset from this address.
● Assign it to a variable called online_rt Note: if you receive a utf-8

decode error, set encoding = 'latin1' in pd.read_csv()
● Create a histogram with the 10 countries that have the most 'Quantity'
ordered except UK
● Exclude negative Quantity entries
check:
Same graph after removing negative quantity.
● Create a scatterplot with the Quantity per UnitPrice by CustomerID for

the top 3 Countries (except UK)
Top 3 countries data

● Plot a line chart showing revenue (y) per UnitPrice (x).
Created a column revenue
Python P3
Import these 2 datasets and check the heads. name the columns or the first
dataset as below.
https://media.geeksforgeeks.org/wp-content/uploads/file.tsv (column names -
'user_id', 'item_id', 'rating', 'timestamp')
'https://media.geeksforgeeks.org/wp-content/uploads/Movie_Id_Ti tles.csv'
● Calculate mean rating of all movies
● Calculate count rating of all movies

● Create dataframe with 'rating' count values and mean rating of each
movie . This will be movie wise count or ratings and mean rating.
● Plot graph of 'num of ratings column'. This will be a historam chart

using the above ratings dataframe
● Plot graph of 'ratings' column. A histogram with average rating column
● analyse correlation of Star Wars with other movies [correlation or
ratings]
Created a grouped df based on sum of rating and counts of rating
Creating dfs separate for star wars and other movies

Calculating correlation:
ON ANALYSING WE GET 0 CORRELATION AS MOST OF THE DATA POINTS HAVE VERY LESS VARIANCE
AND ARE ALMOST SAME IN VALUES
● analyse correlation of 12 Angry Men with other movies [correlation or

ratings]
ON ANALYSING WE GET 0 CORRELATION AS MOST OF THE DATA POINTS HAVE VERY LESS VARIANCE
AND ARE ALMOST SAME IN VALUES
Notebook link for reference :

https://colab.research.google.com/drive/1Ldsy2g9wvavUaUTP5gUb7QgLyy_DZaag?usp=sharing

Happay

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Happay

Uploaded by

Copyright:

Available Formats

SQL

a. a. list the last 25% employees (as per their salary)

c. Employees with Salaries Higher Than Their Departmental Average

d. Find the Duplicate Rows

e. Find the employee with the second highest salary

● Import the necessary libraries

● Import the first dataset cars1 and cars2.

● Join cars1 and cars2 into a single DataFrame called cars

● Oops, there is a column missing, called owners. Create a random number

import matplotlib.pyplot as plt

import seaborn as sns

#Removing unnamed columns in cars1

# Remove unnamed blank columns (columns with no header)

cars1 = cars1.loc[:, ~cars1.columns.str.contains('^Unnamed')]

print("Number of observations in cars1:", observations_cars1)

print("Number of observations in cars2:", observations_cars2)

random_owners = pd.Series([random.randint(15000, 73000) for _ in range(len(cars))])

##Adding owners to cars df

● Import the dataset from this address.

● Assign it to a variable called online_rt Note: if you receive a utf-8

● Exclude negative Quantity entries

● Create a scatterplot with the Quantity per UnitPrice by CustomerID for

Top 3 countries data

● Calculate count rating of all movies

● Plot graph of 'num of ratings column'. This will be a historam chart

Created a grouped df based on sum of rating and counts of rating

Creating dfs separate for star wars and other movies

● analyse correlation of 12 Angry Men with other movies [correlation or

Notebook link for reference :

You might also like