Professional Documents
Culture Documents
Naveen Python - For - Data-Science-Report
Naveen Python - For - Data-Science-Report
Naveen Python - For - Data-Science-Report
SAGAR (M.P.)
REPORT ON INTERNSHIP
SESSION 2023-24
Submitted to
Of
Bachelor of Technology
In Information Technology
Submitted By:-
Naveen Shivnath Verma: 0601IT22D03
1
lOMoARcPSD|33461089
SAGAR (M.P.)
DECLARATION
I hereby declare that the following Internship file on “Python For Data Science” is an
authentic work done by We undertake the Internship as a part of the course
Curriculum
of Bachelor of Technology in Information Technology of Indira Gandhi Engineering
College, Sagar (M.P.) affiliated to Rajiv Gandhi Proudyogiki Vishwavidhyalaya,
Bhopal (M.P.).
ACKNOWLEDGEMENTS
2
lOMoARcPSD|33461089
I extend my heartfelt gratitude to "Mr. Aswin Mishra " Field Supervisor at Infosys
Springboard for your unwavering support and dedication in guiding me through the
Internship process.
I want to express my sincere thanks to “Dr. Hemant Pathak " Head of Training and
Placement Cell for your invaluable assistance in securing this opportunity and for
your commitment to our carrier development.
I owe my sincere thanks to honorable Principal "Dr. Anurag Trivedi" for his kind
support which he rendered us in the envisagement for the great success of our
internship.
INDEX
Sr. Topic Page
3
lOMoARcPSD|33461089
No. No.
1 Title 1
2 Index 4
3 Training Certificate 5
4 Declaration 2
5 Acknowledgement 3
6 About Training 6
7 About Infosys Springboard 6
8 Objectives 6
9 Data Science 7-8
10 My Learnings 9-18
11 Final Project 19-20
12 Reason for choosing Data Science 21
13 Learning Outcome 22
14 Scope in Data Science 23
15 Results 25
4
lOMoARcPSD|33461089
5
lOMoARcPSD|33461089
1. ABOUT TRAINING
NAME OF TRAINING: PYTHON FOR DATA SCIENCE
HOSTING INSTITUTION: IINFOSYS SPRINGBOARD
DATES: From 29th May 2023 to 29th August 2023
3. OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and reach conclusions
to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of marketing and sales with
sales forecasting based on weather.
4. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer
science to study and evaluate data. The key objective of Data Science is to extract valuable
information for use in strategic decision making, product development, trend analysis, and
forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and
data retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data
science practitioners apply machine learning algorithms to numbers, text, images, video,
audio, and more to produce artificial intelligence (AI) systems to perform tasks that
ordinarily require human intelligence. In turn, these systems generate insights which analysts
and business users can translate into tangible business value.
6
lOMoARcPSD|33461089
7
lOMoARcPSD|33461089
MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
• Applications of Data Science
Unfamiliar detection (fraud, disease, etc.)
8
lOMoARcPSD|33461089
9
lOMoARcPSD|33461089
The field of bringing insights from data using scientific techniques is called data science.
Applications
Computer Vision - The advancement in recognizing an image by a computer involves processing large
sets of image data from multiple objects of same category. For example, Face recognition.
Big Data
What is likely to
happen?
Complexity
Predictive Analysis
What’s happening
now?
Dashboards
Why did it
happen?
Detective Analysis
What happened?
Reporting
10
lOMoARcPSD|33461089
Detective Analysis
Asking questions based on data we are seeing, like. Why something happened?
Predictive Modelling
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data.
Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
Deciding the right credit limit for credit card customers.
Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
4. Personalized search results
11
lOMoARcPSD|33461089
Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.
Why Python???
LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.
DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
12
lOMoARcPSD|33461089
Statistics
Descriptive
Statistic Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
Types of variables
Continous – Which takes continuous numeric values. Eg-marks
Categorial-Which have discrete values. Eg- Gender
Ordinal – Ordered categorial variables. Eg- Teacher feedback
Nominal – Unorderd categorial variable. Eg- Gender
13
lOMoARcPSD|33461089
Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
Typos-During collection. Eg-adding extra zero by mistake.
Measurement Error-Outliers in data due to measurement operator being faulty.
Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol
consumed then actual.
Legit Outlier—These are values which are not actually errors but in data due to legitimate
reasons. Eg - a CEO’s salary might actually be high as compared to other employees.
14
lOMoARcPSD|33461089
Predictive Modelling
Making use of past data and attributes we predict future using this data.
Eg-
Past Horror Movies
Future Unwatched Horror Movies
Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.
15
lOMoARcPSD|33461089
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
Probability of default would lie between 0 and 1.
Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).
Algorithm Selection
Example-
Yes No
Supervised Unsupervised
Learning Learning
Is dependent
variable continuous?
Yes No
Regression Classification
Eg- Predict the customer will buy product or not.
16
lOMoARcPSD|33461089
Algorithms
Logistic Regression
Decision Tree
Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
Train
Past data (known dependent variable).
Used to train model.
Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.
17
lOMoARcPSD|33461089
6. FINAL PROJECT
SOURCE CODE:
OUTPUT:
18
lOMoARcPSD|33461089
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the
‘sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing about the
technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.
Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
19
lOMoARcPSD|33461089
Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy
20
lOMoARcPSD|33461089
Learning Outcome
21
lOMoARcPSD|33461089
22
lOMoARcPSD|33461089
application. Virtual Reality (VR) and Augmented Reality (AR) are undergoing monumental modifications too.
In addition, human and machine interaction, as well as dependency, is likely to improve and increase
drastically.
23
lOMoARcPSD|33461089
8. RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m able to perform
data analysis using python. I also attempted various quizzes and assignments provided for periodic evaluation
during 6 weeks and completed this training with 100% score in Final Test.
***
24