Professional Documents
Culture Documents
MAJOR Mid Sem Report
MAJOR Mid Sem Report
MAJOR Mid Sem Report
SYNOPSIS
Submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
ON
Submitted by:
Aashish Kumar(BFSI)
Aryan Tayal(ECRA)
Vishwatej Ajay(ECRA)
I hereby declare that this submission is my own and that, to the best of my
knowledge and belief, it contains no material previously published or written by
another person nor material which has been accepted for the award of any other
Degree or Diploma of the University or other Institute of Higher Learning, except
where due acknowledgement has been made in the text.
Date
ACKNOWLEDGEMENT
I would like to express my deepest appreciation to all those who provided me the
possibility to complete this report. I am very thankful to UPES for giving me the
opportunity to undertake the summer internship. I wish to express my sincere
gratitude to Dr. Shweta Mongia for providing me an opportunity to do my project
work on Loan Prediction. This project bears an imprint of many people. I sincerely
thank my project guide Dr. Shweta Mongia for guidance and encouragement for
carrying out this project work.
I have to appreciate the guidance given by other supervisor as well as the panels
especially in my project presentation that has improved our presentation skills,
thanks to their comment and advices.
1.3 WHAT IS PREDICTIVE MODELING?
Predictive modeling is a process that uses data and statistics to predict outcomes
with data models. These models can be used to predict anything from sports
outcomes and TV ratings to technological advances and corporate earnings.
These synonyms are often used interchangeably. However, predictive analytics
most often refers to commercial applications of predictive modeling, while
predictive modeling is used more generally or academically. Of the terms,
predictive modeling is used more frequently, which is illustrated in the Google
Trends chart below. Machine learning is also distinct from predictive modeling and
is defined as the use of statistical techniques to allow a computer to construct
predictive models. In practice, machine learning and predictive modeling are often
used interchangeably. However, machine learning is a branch of artificial
intelligence, which refers to intelligence displayed by machines.
1.4 OVERVIEW:
Predictive modeling is useful because it gives accurate insight into any question and
allows users to create forecasts. To maintain a competitive advantage, it is critical
to have insight into future events and outcomes that challenge key assumptions.
Analytics leaders must align predictive modeling initiatives with an organization’s
strategic goals. For example, a computer chip manufacturer might set a strategic
priority to produce chips with the greatest number of transistors in the industry by
2025. Analytics professionals could construct a predictive model to forecast the
number of transistors per chip to become a leader if they feed the model product,
geography, sales, and other related trend data. Additional sources could include
data about the most transistor-dense chips, commercial demand for computing
power, and strategic partnerships between chip manufacturers and hardware
manufacturers. Once initiatives are in motion, analytics professionals can perform
backward-looking analyses to assess the accuracy of predictive models and the
success of the initiatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Importing Dataset:
train=pd.read_csv('Train.csv')
train.head(8)
O/P:
Taking a look at the features of the dataset:
train.info()
O/P:
Taking a look at the number of null values:
train.isnull().sum()
O/P:
O/P:
For Outlet_Size:
O/P:
Importing test dataset:
test=pd.read_csv('Test.csv')
For Outlet_Sales:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent',
verbose=0, copy= False)
imputer = imputer.fit(data[['Outlet_Size']])
data['Outlet_Size'] = imputer.transform(data[['Outlet_Size']])
Earlier we saw that the Item_Type variable has 16 categories which might not
prove to be very useful in our analysis.
So it’s a good idea to combine them. If we look closely to the Item_Identifier of
each item we see that each one starts with either “FD” (Food), “DR” (Drinks) or
“NC” (Non-Consumables).
Getting the first two characters of ID:
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low
Fat','reg':'Regular','low fat':'Low Fat'})
print(data['Item_Fat_Content'].value_counts())
O/P:
FEATURE ENGINEERING:
Since scikit-learn only accepts numerical variables, we need to convert all
categories of nominal variables into numeric types. Let’s start with turning all
categorical variables into numerical values using LabelEncoder().(Encode labels
with value between 0 and no_of_classes-1)
After that, we can use get_dummies to generate dummy variables from these
numerical categorical variables.
test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)