Naveen Python - For - Data-Science-Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

lOMoARcPSD|33461089

INDHIRA GANDHI ENGINEERING COLLEGE

SAGAR (M.P.)

REPORT ON INTERNSHIP

“PYTHON FOR DATA SCIENCE”

SESSION 2023-24

Submitted to

Rajiv Gandhi Engineering Vishwavidyalaya, Bhopal (M.P.)

In partial fulfilment of the degree

Of

Bachelor of Technology
In Information Technology

Guided By: Submitted To:-

Mrs. Shalupriya Jain Prof. R.S.S. Rawat


(Guest Faculty) (Head of Department)
Department of Information Technology, Department of Information
I.G.E.C Sagar (M.P.) I.G.E.C. Sagar (M.P.)

Submitted By:-
Naveen Shivnath Verma: 0601IT22D03
1
lOMoARcPSD|33461089

INDHIRA GANDHI ENGINEERING COLLEGE

SAGAR (M.P.)

DECLARATION

I hereby declare that the following Internship file on “Python For Data Science” is an
authentic work done by We undertake the Internship as a part of the course
Curriculum
of Bachelor of Technology in Information Technology of Indira Gandhi Engineering
College, Sagar (M.P.) affiliated to Rajiv Gandhi Proudyogiki Vishwavidhyalaya,
Bhopal (M.P.).

Naveen Shivnath Verma : 0601IT223D03

ACKNOWLEDGEMENTS

2
lOMoARcPSD|33461089

It is with great reverence that I express my gratitude to my guides "Ms. Shalupriya


Jain" Department of Information Technology Engineering, Indira Gandhi Engineering
College for their precious guidance and help in this internship work. The credit for the
successful completion of this internship goes to their keen interest timing guidance
and valuable suggestion otherwise my endeavor would have been futile

I extend my heartfelt gratitude to "Mr. Aswin Mishra " Field Supervisor at Infosys
Springboard for your unwavering support and dedication in guiding me through the
Internship process.

I owe to regard to "Prof. R.S.S. Rawat" Head of Department of Information


Technology Engineering for his persistant encouragement and blessing which were
bestowed upon me.

I want to express my sincere thanks to “Dr. Hemant Pathak " Head of Training and
Placement Cell for your invaluable assistance in securing this opportunity and for
your commitment to our carrier development.

I owe my sincere thanks to honorable Principal "Dr. Anurag Trivedi" for his kind
support which he rendered us in the envisagement for the great success of our
internship.

Naveen Shivnath Verma : 0601IT223D03

INDEX
Sr. Topic Page
3
lOMoARcPSD|33461089

No. No.
1 Title 1
2 Index 4
3 Training Certificate 5
4 Declaration 2
5 Acknowledgement 3
6 About Training 6
7 About Infosys Springboard 6
8 Objectives 6
9 Data Science 7-8
10 My Learnings 9-18
11 Final Project 19-20
12 Reason for choosing Data Science 21
13 Learning Outcome 22
14 Scope in Data Science 23
15 Results 25

4
lOMoARcPSD|33461089

5
lOMoARcPSD|33461089

1. ABOUT TRAINING
NAME OF TRAINING: PYTHON FOR DATA SCIENCE
HOSTING INSTITUTION: IINFOSYS SPRINGBOARD
DATES: From 29th May 2023 to 29th August 2023

2. ABOUT INFOSYS SPRINGBOARD


Infossys is an internship and online training platform, based in Gurgaon, India. Founded in 2011 by Sarvesh
Agrawal, an IIT Madras alumni. The site offers searching and posting internships, and other career services
such as counselling, cover-letter writing, resume building and training programs to students.

3. OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and reach conclusions
to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of marketing and sales with
sales forecasting based on weather.

4. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer
science to study and evaluate data. The key objective of Data Science is to extract valuable
information for use in strategic decision making, product development, trend analysis, and
forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and
data retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data
science practitioners apply machine learning algorithms to numbers, text, images, video,
audio, and more to produce artificial intelligence (AI) systems to perform tasks that
ordinarily require human intelligence. In turn, these systems generate insights which analysts
and business users can translate into tangible business value.
6
lOMoARcPSD|33461089

DATA SCIENCE PROCESS:


1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the
data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully completed this step,
you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations based
on visual and descriptive techniques. The insights you gain from this phase will enable
you to start modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the
predictions stated in your project charter. Now is the time to bring out the heavy guns,
but remember research has taught us that often (but not always) a combination of
simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the
analysis, if needed. One goal of a project is to change a process and/or make better
decisions. You may still need to convince the business that your findings will indeed
change the business process as expected. This is where you can shine in your influencer
role. The importance of this step is more apparent in projects on a strategic and tactical
level. Certain projects require you to perform the business process over and over again,
so automating the project will save time.

7
lOMoARcPSD|33461089

MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
• Applications of Data Science
 Unfamiliar detection (fraud, disease, etc.)

 Automation and decision-making (credit worthiness, etc.)


 Classifications (classifying emails as “important” or “junk”)
 Forecasting (sales, revenue, etc.)
 Pattern detection (weather patterns, financial market patterns, etc.)
 Recognition (facial, voice, text, etc.)
 Recommendations (based on learned preferences, recommendation engines can refer you to
movies, restaurants and books you may like)

8
lOMoARcPSD|33461089

2) PYTHON FOR DATA SCIENCE


Introduction to Python, Understanding Operators, Variables and Data Types, Conditional Statements,
Looping Constructs, Functions, Data Structure, Lists, Dictionaries, Understanding Standard Libraries in
Python, reading a CSV File in Python, Data Frames and basic operations with Data Frames, Indexing Data
Frame.

3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE


Introduction to Statistics, Measures of Central Tendency, Understanding the spread of data, Data Distribution,
Introduction to Probability, Probabilities of Discrete and Continuous Variables, Normal Distribution,
Introduction to Inferential Statistics, Understanding the Confidence Interval and margin of error, Hypothesis
Testing, Various Tests, Correlation.

4) PREDICTIVE MODELING AND BASICS OF MACHINE LEARNING


Introduction to Predictive Modeling, Types and Stages of Predictive Models, Hypothesis Generation, Data
Extraction and Exploration, Variable Identification, Univariate Analysis for Continuous Variables and
Categorical Variables, Bivariate Analysis, Treating Missing Values and Outliers, Transforming the Variables,
Basics of Model Building, Linear and Logistic Regression, Decision Trees, K-means Algorithms in Python.
Summary of Procedure of Analyzing Data:
Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression

9
lOMoARcPSD|33461089

Introduction to Data Science


Data Science

The field of bringing insights from data using scientific techniques is called data science.

Applications

Amazon Go – No checkout lines

Computer Vision - The advancement in recognizing an image by a computer involves processing large
sets of image data from multiple objects of same category. For example, Face recognition.

Spectrum of Business Analysis

What can happen?


Given data is
collected and used.

Big Data

What is likely to
happen?
Complexity

Predictive Analysis

What’s happening
now?

Dashboards

Why did it
happen?

Detective Analysis

What happened?
Reporting

Value added to organization

10
lOMoARcPSD|33461089

Reporting / Management Information System

To track what is happening in organization.

Detective Analysis

Asking questions based on data we are seeing, like. Why something happened?

Dashboard / Business Intelligence

Utopia of reporting. Every action about business is reflected in front of screen.

Predictive Modelling

Using past data to predict what is happening at granular level.

Big Data

Stage where complexity of handling data gets beyond the traditional system.

Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data.

Application of Data Science


 Recommendation System
Example-In Amazon recommendations are different for different users according to their past search.

 Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
 Deciding the right credit limit for credit card customers.
 Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
 How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
4. Personalized search results

11
lOMoARcPSD|33461089

Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.

Python for Data science:

Why Python???

1. Python is an open source language.


2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.
 UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
 VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and
strings.
 CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
 LOOPING CONSTRUCTS:
For loop
 FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
 DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.

12
lOMoARcPSD|33461089

Statistics

Descriptive
Statistic Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
Types of variables
 Continous – Which takes continuous numeric values. Eg-marks
 Categorial-Which have discrete values. Eg- Gender
 Ordinal – Ordered categorial variables. Eg- Teacher feedback
 Nominal – Unorderd categorial variable. Eg- Gender

13
lOMoARcPSD|33461089

Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
 Typos-During collection. Eg-adding extra zero by mistake.
 Measurement Error-Outliers in data due to measurement operator being faulty.
 Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol
consumed then actual.
 Legit Outlier—These are values which are not actually errors but in data due to legitimate
reasons. Eg - a CEO’s salary might actually be high as compared to other employees.

Interquartile Range (IQR)


Is difference between third and first quartile from last. It is robust to outliers.
Histograms
Histograms depict the underlying frequency of a set of discrete or continuous data that are measured on an
interval scale.
import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()
Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample data.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then
examining what the data tells us about how to proceed. The hypothesis to be tested is called the null
hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is
given the symbol Ha.

14
lOMoARcPSD|33461089

Predictive Modelling

Making use of past data and attributes we predict future using this data.
Eg-
Past Horror Movies
Future Unwatched Horror Movies

Predicting stock price movement


1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.
Types
1. Supervised Learning
Supervised learning is a type algorithm that uses a known dataset (called the training dataset) to
make predictions. The training dataset includes input data and response values.
 Regression-which have continuous possible values. Eg-Marks
 Classification-which have only two values. Eg-Cancer prediction is either 0 or 1.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither classified nor.
Here the task of machine is to group unsorted information according to similarities, patterns and
differences without any prior training of data.
 Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behaviour.
 Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.

Stages of Predictive Modelling


1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation

Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.

15
lOMoARcPSD|33461089

Model Building

It is a process to create a mathematical model for estimating / predicting the future based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
 Probability of default would lie between 0 and 1.
 Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).

Steps in Model Building


1. Algorithm Selection
2. Training Model
3. Prediction / Scoring

Algorithm Selection
Example-

Have dependent variable?

Yes No

Supervised Unsupervised
Learning Learning

Is dependent
variable continuous?

Yes No

Regression Classification
Eg- Predict the customer will buy product or not.

16
lOMoARcPSD|33461089

Algorithms
 Logistic Regression
 Decision Tree
 Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
 Train
Past data (known dependent variable).
Used to train model.
 Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.

17
lOMoARcPSD|33461089

6. FINAL PROJECT
SOURCE CODE:

OUTPUT:

18
lOMoARcPSD|33461089

Reason for choosing data science

Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the
‘sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing about the
technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.

Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile

19
lOMoARcPSD|33461089

Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy

20
lOMoARcPSD|33461089

Learning Outcome

After completing the training, I am able to:


 Develop relevant programming abilities.
 Demonstrate proficiency with statistical analysis of data.
 Develop the skill to build and assess data-based model.
 Execute statistical analysis with professional statistical software.
 Demonstrate skill in data management.
 Apply data science concepts and methods to solve problem in real-world contexts and will
communicate these solutions effectively.

21
lOMoARcPSD|33461089

7. SCOPE IN DATA SCIENCE FIELD


Few factors that point out to data science’s future, demonstrating compelling reasons why it is crucial to
today’s business needs are listed below:

 Companies’ Inability to handle data


Data is being regularly collected by businesses and companies for transactions and through website
interactions. Many companies face a common challenge – to analyze and categorize the data that is collected
and stored. A data scientist becomes the savior in a situation of mayhem like this. Companies can progress a
lot with proper and efficient handling of data, which results in productivity.

 Revised Data Privacy Regulations


Countries of the European Union witnessed the passing of the General Data Protection Regulation (GDPR) in
May 2018. A similar regulation for data protection will be passed by California in 2020. This will create co-
dependency between companies and data scientists for the need of storing data adequately and responsibly. In
today’s times, people are generally more cautious and alert about sharing data to businesses and giving up a
certain amount of control to them, as there is rising awareness about data breaches and their malefic
consequences. Companies can no longer afford to be careless and irresponsible about their data. The GDPR
will ensure some amount of data privacy in the coming future.

 Data Science is constantly evolving


Career areas that do not carry any growth potential in them run the risk of stagnating. This indicates that the
respective fields need to constantly evolve and undergo a change for opportunities to arise and flourish in the
industry. Data science is a broad career path that is undergoing developments and thus promises abundant
opportunities in the future. Data science job roles are likely to get more specific, which in turn will lead to
specializations in the field. People inclined towards this stream can exploit their opportunities and pursue
what suits them best through these specifications and specializations.

 An astonishing incline in data growth


Data is generated by everyone on a daily basis with and without our notice. The interaction we have with data
daily will only keep increasing as time passes. In addition, the amount of data existing in the world will
increase at lightning speed. As data production will be on the rise, the demand for data scientists will be
crucial to help enterprises use and manage it well.

 Virtual Reality will be friendlier


In today’s world, we can witness and are in fact witnessing how Artificial Intelligence is spreading across the
globe and companies’ reliance on it. Big data prospects with its current innovations will flourish more with
advanced concepts like Deep Learning and neural networking. Currently, machine learning is being
introduced and implemented in almost every

22
lOMoARcPSD|33461089

application. Virtual Reality (VR) and Augmented Reality (AR) are undergoing monumental modifications too.
In addition, human and machine interaction, as well as dependency, is likely to improve and increase
drastically.

 Blockchain updating with Data science


The main popular technology dealing with cryptocurrencies like Bitcoin is referred to as Blockchain. Data
security will live true to its function in this aspect as the detailed transactions will be secured and made note
of. If big data flourishes, then Iot will witness growth too and gain popularity. Edge computing will be
responsible for dealing with data issues and address them.

23
lOMoARcPSD|33461089

8. RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m able to perform
data analysis using python. I also attempted various quizzes and assignments provided for periodic evaluation
during 6 weeks and completed this training with 100% score in Final Test.

***

24

You might also like