Professional Documents
Culture Documents
PTS Report Final
PTS Report Final
This is to certify that Soumya Awasthi (16EARCS746) has successfully delivered Practical
Training Seminar to satisfaction and report is submitted for partial fulfillment of requirement
for the award of degree of Bachelor of Technology in Computer Science & Engineering
under Rajasthan Technical University, Kota.
1
ACKNOWLEDGEMENT
I would like to thank sincerely to my guide Er. Siddharth Jain for his invaluable guidance,
constant assistance, support and constructive suggestions for the betterment of the technical
seminar. I would like to convey my heartfelt thanks to our Head of Department Er. Akhil
Pandey for giving me the opportunity to embark upon this topic for his continued
encouragement throughout the preparation of this presentation. I also want to thank all the
staff members of the department of Computer Science for helping directly or indirectly in
completing this work successfully.Finally, I am thankful to my parents and friends for their
continued moral and material support throughout the course and helping me to finalize this
report.
Soumya Awasthi
2
LIST OF CONTENTS
1. INTRODUCTION................................................................................................................4
1.1. What is data science
2. TECHNOLOGY USED....................................................................................................7
3. TECHNICAL DETAILS AND WORKING....................................................................1O
4. APPLICATIONS AND CASE STUDY..........................................................................11
5. CONCLUSION..............................................................................................................20
3
ABSTRACT
This software design document provides the details of different areas of Data Science.
The purpose of this practical training is to gain knowledge in the field of data science. We
have covered different tools and languages used for Data Science like Python, R, Hadoop and
Spark. We will start with Programming using python, all the basic concepts related with
python comes under this. Then proceed towards the Advance Python like modules packages
comes under this and then moving ahead with R programming. After all the basic concept we
will start move forward towards the data processing and some of ML (Machine Learning)
Algorithm. Where we will process the data and classifies the data according to the given
scenario. Atlast we will move towards the Big Data and different tools like Spark, Hive.
4
CHAPTER-I
INTRODUCTION
As the world entered the era of big data, the need for its storage also grew. It was the main
challenge and concern for the enterprise industries until 2010. The main focus was on bulding
framework and solutions to store data. Now when Hadoop and other frameworks have
successfully solved the problem of storage, the focus has shifted to the processing of this
data. Data Science is the secret sauce here. All the ideas which you see in Hpllywood sci-fi
movies can actually turn into reality by Data Science. Data Science is the future of Artificial
Intelligence. Therefore, it is very important to understand what is Data Science and how can
it add value to your business.
Traditionally, the data that we had was mostly structured and small in size, which could be
analyzed by using the simple BI tools. Unlike data in the traditional systems which was
mostly structured, today most of the data is unstructured or semi-strctured. Let’s have a look
at the data trends in the image given below which shows that by 2020, more than 80% of the
data will be unstructured.
This data is generated from different sources like financial logs, text files, multimedia forms,
sensors, and instruments. Simple BI tools are not capabale of processing this huge volume
and variety of data. This is why we need more complex and advanced analytical tools and
algorithms for processing, analyzing and drawing meaningful insights out of it.
5
This is not the only reason why Data Science has become so popular. Let’s dig deeper and
see how Data Science is being used in various domains.
How about if you could understand the precise requirements of your customers from the
existing data like the customer’s past browsing history, purchase history, age and income. No
doubt you had all this data earlier too, but now with the vast amount and variety of data , you
can train models more effectively and recommend the product to your customers with more
precision. Wouldn’t it be amazing as it will bring more business to your organization?
Let’s take a different scenario to understand the role of Data Science in desision making.
How about if you car had the intelligence to drive you home? The self-driving cars collect
live data from sensors, including radars, cameras and lasers to create a map of its
surrounding. Based on this data it takes decisions like when to speed up when to speed down,
when to overtake, where to take a turn – making use of advance machine learning algorithms.
Let’s see how Data Science can be used un predictive analytics. Let’s take weather
forecasting as an example. Data from ships, aircrafts, radars, satellites can be collected and
analyzed to build models. These models will not only forecast the weather but also help in
predicting the occurrence of any natural calamities. It will help help you to take appropriate
measures beforehand and save many precious lives.
First, let’s see what is Data Science is a blend of various tools, algorithms, and machine
learning principles with the goal to discover hidden patterns from the raw data. How is this
different from what statisticians have been doing for years?
So, Data cience is primarily used to make decisions and predictions making use of predictive
casual analytics, prescriptive analytics (predictive plus decision science) and machine
learning.
Predictive casual analytics – If you want model which can predict the possibilities of a
particular event in the future, you need to apply predictive casual analytics. Say, if you are
6
providing money on credit, then the probability of customers making future credit payments
on time is a matter of concern for you. Here, you can build a model which can perform
predictive analytics on the payment history of the customer to predict if the future payments
will be on time or not.
Prescriptive analytics – If you want a mdel which has the intelligence of taking its own
decisions and the ability to modify it with dynamis parameters, you certainly need
prescriptive analytics for it. This relatively new field is all about providing advice. In other
terms, it not only predicts but suggests a range of prescribed actions and associated outcomes.
The best example for this is Google’s self-driving car. The data gathered by vehicles can be
used to train self-driving cars. You can run algorithms on this data to bring intelligence to it.
This will enable your car to take decisions like when to turn, which path to take, when to
slow down or speed up.
Machine learning for making predictions – If you have transactional data of finance
company and need to build a model to determine the future trend, trend machine learning
algorithms are the best bet. This falls under the paradigm of supervised learning. It is called
supervised because you already have the data based on which you can train your machines.
For example, a fraud detection model can be trained using a historical record of fraudlent
purchases.
Machine learning for pattern discovery – If you don’t have the parameters based on whixh
you can make predictions, then you need to find out the hidden patters within the dataset to
be able to make meaningful predictions. This is nothing but the unsupervised model as you
don’t have any predefined labels for grouping. The most common algorithm used for pattern
discovery is Clustering.
Let’s say you are working in a telephone company and you need to establish a network by
putting towers in a region. Then, you can use the clustering technique to find those tower
locations which will ensure that all the users receive optimum signal strength
7
CHAPTER-II
TECHNOLOGY USED
Core Python
Advance python
Algorithms + Statics
Data Analysis
R Language
Bigdata Hadoop
Data Science
1. Core Python : Introduction to Python, Basic Syntax, Data Types, Variables, Operators,
Input/output, Flow of Control (Modules, Branching), If, If-else, Nested if-else, Looping,
For, While, Nested loops, Control Structure, Break, Continue, Pass, String and Tuples,
Accessing list, assign and retrieve values from Lists, Intoducing Tuples, Accesing tuples
Operations, Functions and Functional Programming, Declaring and calling Functions,
Special Functions in python lambda, map and reduce. Advance functions in python var
length arguments and Closures and Decoratos. Namespace and Generator and Iterators.
2. Advance Python: Object Oriented, OOPs concept, Class and object, Attributes,
Inheritance, Overloading, Overriding, Data hiding, Meta Classes, Shared Memory
concepts, Exception Handling, except clause, Try finally clause, User Defined
Exceptions, Debugging mdules pdb, doctest and loggers. Python Libraries – NUMPY,
SCIPY, PANDAS, Scikit-Learn, matplotlib, bs4 etc.
3. Probability & Statistics: Introduction to Statistics- Descriptive Statistics, Summary
Statistics Basic probability theory, Statistical Concepts (uni-variate and bi-variate
sampling, distributions, re-sampling, statistical Inference, prediction error), Probability
Distribution (Continuous and discrete – Normal, Bernoulli, Binomial, Negative Binomial,
Geometric and Poisson distribution), Bayes’ Theorem, Central Limit theorem, Data
Exploration & preparation, Concepts of Correlation, Regression, Covariance, Outliers etc.
4. R Programming: Introduction & installation of R, R Basics, Finding Help, Code Editors
for R, Command Packages, Manipulating and Processing Data in R, Reading and Getting
Data into R, Exporting Data from R, Data Objects-Data Types & Data Structure. Viewing
8
Named Objects, Structure of Data Items, Manipulating and Processing Data in R
(Creating, Accessing, Sorting data frames, Extracting, Combining, Merging, reshaping
data frames), Control Structures, Functions in R (numeric, character, statistical), working
with objects, Viewing Objects within Objects, Constructing Data Objects, Building R
Packages, Running and Manipulating Packages, Non parametric Tests – ANOVA, chi-
Square, t-Test, U-Test, Introduction to Graphical Analysis, Using Plots(Box plots, Scatter
plot, Pie Charts, Bar charts, Line Chart), Plotting variables, Designing Special Plots,
Simple Liner Regression, Multiple Regression
Project: Market Basket Analysis, Housing Price Predictions, Student Evalution.
5. Big Data: Introduction to Big Data- Big data definition, enterprise / structured data,
social / unstructured data, unstructured data needs for analytics, What is Big Data, Big
Deal about Big Data, Big Data Sources, Industries using Big Data, Big Data challenges.
Hadoop ETL: Hadoop Etl Development, ETL Process in Hadoop, Discussion ofETL
function, Data Extractions, Need of ETL tools, Advantages of ETL tools.
Pig and HIVE: Programming Pig: Engine for executing data flows in parallel on
Hadoop, Programming with Hive: Data warehouse syetem system for Hadoop,
Optimizing with Combiners and Partitioners (lab), More common algorithms: sorting,
indexing and searching (lab), Realtional Manipulation: map-sideand reduce-side joins
(lab), evolution, pupose and use, HDFS – Overiew and concepts, data flow (read and
write), interface to HDFS (HTTP, CLI and Java API), high availability and Name Node
ferderation, Map Reduce developing and deployoing programs, optimization techniques,
9
Map Reduce Anatomy, Data flow framework programming Map Reduce Anatomy, Data
flow framework programming Map Reduce best practices and debugging, Introduction to
Hadoop ecosystem, integration R with Hadoop.
10
CHAPTER-III
TECHNOLOGY DETAILS AND WORKING
TECHNOLOGY USED :
Ggplot2 :
This is the backbone of this project. Ggplot2 is the most popular data visualization library
that is most widely used for creating aesthetic visualization plots.
Ggthemes :
This is more of an add-on to our main ggplot2 library. With this we can create better extra
themes and scales with the mainstream ggplot2 package.
Lubricate :
Our dataset involeves various time-frames. In order to understand our data in separate time
categories, we will make use of the lubricate package.
Dplyr :
This package is the lingua franca of data manipulation in R.
Tidyr :
This package will help you to tidy your data. The basic principle of tidyr is to tidy the
columns where each variable is present in a column, each observation is represented by
roe and each value depicts a cell.
11
CHAPTER-IV
APPLCATIONS AND CASE STUDY
CODE :
INPUT SCREENSHOT 1:
INPUT SCREENSHOT 2:
INPUT SCREENSHOT 3:
12
INPUT SCREENSHOT 4:
INPUT SCREENSHOT 5:
13
OUTPUT SCREENSHOT 1:
OUTPUT SCREENSHOT 2:
14
OUTPUTS :
15
16
17
18
19
CONCLUSION
At the end of the Uber data analysis R projrct, we observe how to create data visualisation.
We made use of packages like ggplot2 that allowed us to plot various types of visualizations
that pertained to several time affected customer trips.
REFERENCES
1. Webpage : https://data-flair.training/blogs/r-data-science-project-uber-data-analysis/
2. David Beazley, “The Python Journey”, “Python Cookbook”, , 2774 pages
20