Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

SQL

Q1)
Created Tables PostgreSQL and finding out answers.

a. a. list the last 25% employees (as per their salary)

Explanation:

The query orders employees by salary in descending order and then I used offset to skip first 75 %
leaving only last 25% in terms of salary

Output:

b. list all the employees who have salary greater than the average salary of the entire dataset
Explanation :

I have used a subquery which calculates the average salary .The subquery gets executed first and this
subquery is used as a filter to main query to get the requirement.

Output:

c. Employees with Salaries Higher Than Their Departmental Average

Explanation:
The inner subquery calculates the average using window and partitioned on basis of department to
get each department average .The main query uses this as a filter.

Output:

d. Find the Duplicate Rows

Explanation:
query identifies and lists duplicates in basis of combination of employee_id,first_name,last_name
and dept_id.
The data given had zero duplicates on basis of employee_id,first_name,last_name and dept_id.

e. Find the employee with the second highest salary

explanation:
Query orders on basis of salary in descending order then selects the second highest using limit and
offset

Q2:

Explanation:
The query uses sum() as a window function and ordering on basis of transaction date.
Output:

Q3

Explanation:
Here I created a CTE(common table expression) named LatestEmployment which selects the
emp_id,emp_profile and emp_join_date for each employe where their emp_join_date is latest using
max().
Then I have left joined with employee_table to get name and other requirements.
Output:

Q4

Here I have used left join as I considered employee table to be main table and
wanted to get all details inspite of having null .
If I had used inner join then nikita would have been left out and now I can
change my data and add the details for nikita.
PYTHON:
Python P1

● Import the necessary libraries

● Import the first dataset cars1 and cars2.

url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars1.csv"

cars1 = pd.read_csv(url1)

cars1

url2="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars2.csv"

cars2=pd.read_csv(url2)

cars2
● Oops, it seems our first dataset has some unnamed blank columns, fix
cars1
● What is the number of observations in each dataset?

● Join cars1 and cars2 into a single DataFrame called cars

● Oops, there is a column missing, called owners. Create a random number


Series from 15,000 to 73,000.
● Add the column owners to cars

Overall code:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import random

import seaborn as sns

##cars1

url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars1.csv"

cars1 = pd.read_csv(url1)

cars1

##cars2
url2="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/
cars2.csv"

cars2=pd.read_csv(url2)

cars2

#Removing unnamed columns in cars1

# Remove unnamed blank columns (columns with no header)

cars1 = cars1.loc[:, ~cars1.columns.str.contains('^Unnamed')]

cars1

##Observations

observations_cars1 = len(cars1)

observations_cars2 = len(cars2)

print("Number of observations in cars1:", observations_cars1)

print("Number of observations in cars2:", observations_cars2)

##cars

cars=pd.concat([cars1,cars2])

cars

##owner

random_owners = pd.Series([random.randint(15000, 73000) for _ in range(len(cars))])

random_owners

##Adding owners to cars df

cars['owners'] = random_owners

cars

Python P2
● Import the necessary libraries

● Import the dataset from this address.

● Assign it to a variable called online_rt Note: if you receive a utf-8


decode error, set encoding = 'latin1' in pd.read_csv()

● Create a histogram with the 10 countries that have the most 'Quantity'
ordered except UK

● Exclude negative Quantity entries

check:
Same graph after removing negative quantity.

● Create a scatterplot with the Quantity per UnitPrice by CustomerID for


the top 3 Countries (except UK)

Top 3 countries data


● Plot a line chart showing revenue (y) per UnitPrice (x).
Created a column revenue

Python P3
Import these 2 datasets and check the heads. name the columns or the first
dataset as below.
https://media.geeksforgeeks.org/wp-content/uploads/file.tsv (column names -
'user_id', 'item_id', 'rating', 'timestamp')
'https://media.geeksforgeeks.org/wp-content/uploads/Movie_Id_Ti tles.csv'
● Calculate mean rating of all movies

● Calculate count rating of all movies


● Create dataframe with 'rating' count values and mean rating of each
movie . This will be movie wise count or ratings and mean rating.

● Plot graph of 'num of ratings column'. This will be a historam chart


using the above ratings dataframe
● Plot graph of 'ratings' column. A histogram with average rating column
● analyse correlation of Star Wars with other movies [correlation or
ratings]

Created a grouped df based on sum of rating and counts of rating

Creating dfs separate for star wars and other movies


Calculating correlation:

ON ANALYSING WE GET 0 CORRELATION AS MOST OF THE DATA POINTS HAVE VERY LESS VARIANCE
AND ARE ALMOST SAME IN VALUES

● analyse correlation of 12 Angry Men with other movies [correlation or


ratings]

ON ANALYSING WE GET 0 CORRELATION AS MOST OF THE DATA POINTS HAVE VERY LESS VARIANCE
AND ARE ALMOST SAME IN VALUES

Notebook link for reference :


https://colab.research.google.com/drive/1Ldsy2g9wvavUaUTP5gUb7QgLyy_DZaag?usp=sharing

You might also like