PTS Report Final

CERTIFICATE
This is to certify that Soumya Awasthi (16EARCS746) has successfully delivered Practical
Training Seminar to satisfaction and report is submitted for partial fulfillment of requirement
for the award of degree of Bachelor of Technology in Computer Science & Engineering
under Rajasthan Technical University, Kota.
SEMINAR COORDINATOR Dr. Akhil Pandey

Name Head of Department
Designation Computer Science & Engineering
1
ACKNOWLEDGEMENT
I would like to thank sincerely to my guide Er. Siddharth Jain for his invaluable guidance,
constant assistance, support and constructive suggestions for the betterment of the technical
seminar. I would like to convey my heartfelt thanks to our Head of Department Er. Akhil
Pandey for giving me the opportunity to embark upon this topic for his continued
encouragement throughout the preparation of this presentation. I also want to thank all the
staff members of the department of Computer Science for helping directly or indirectly in
completing this work successfully.Finally, I am thankful to my parents and friends for their
continued moral and material support throughout the course and helping me to finalize this
report.
Soumya Awasthi
4th year 7th Semester (16EARCS746)
Department of Computer Science
Arya College of Engineering & IT
2
LIST OF CONTENTS
1. INTRODUCTION................................................................................................................4
1.1. What is data science
2. TECHNOLOGY USED....................................................................................................7
3. TECHNICAL DETAILS AND WORKING....................................................................1O
4. APPLICATIONS AND CASE STUDY..........................................................................11
5. CONCLUSION..............................................................................................................20
3
ABSTRACT
This software design document provides the details of different areas of Data Science.
The purpose of this practical training is to gain knowledge in the field of data science. We
have covered different tools and languages used for Data Science like Python, R, Hadoop and
Spark. We will start with Programming using python, all the basic concepts related with
python comes under this. Then proceed towards the Advance Python like modules packages
comes under this and then moving ahead with R programming. After all the basic concept we
will start move forward towards the data processing and some of ML (Machine Learning)
Algorithm. Where we will process the data and classifies the data according to the given
scenario. Atlast we will move towards the Big Data and different tools like Spark, Hive.
4
CHAPTER-I
INTRODUCTION
As the world entered the era of big data, the need for its storage also grew. It was the main
challenge and concern for the enterprise industries until 2010. The main focus was on bulding
framework and solutions to store data. Now when Hadoop and other frameworks have
successfully solved the problem of storage, the focus has shifted to the processing of this
data. Data Science is the secret sauce here. All the ideas which you see in Hpllywood sci-fi
movies can actually turn into reality by Data Science. Data Science is the future of Artificial
Intelligence. Therefore, it is very important to understand what is Data Science and how can
it add value to your business.
Let’s Understand Why We Need Data Science
Traditionally, the data that we had was mostly structured and small in size, which could be
analyzed by using the simple BI tools. Unlike data in the traditional systems which was
mostly structured, today most of the data is unstructured or semi-strctured. Let’s have a look
at the data trends in the image given below which shows that by 2020, more than 80% of the
data will be unstructured.
This data is generated from different sources like financial logs, text files, multimedia forms,
sensors, and instruments. Simple BI tools are not capabale of processing this huge volume
and variety of data. This is why we need more complex and advanced analytical tools and
algorithms for processing, analyzing and drawing meaningful insights out of it.
5
This is not the only reason why Data Science has become so popular. Let’s dig deeper and
see how Data Science is being used in various domains.
How about if you could understand the precise requirements of your customers from the
existing data like the customer’s past browsing history, purchase history, age and income. No
doubt you had all this data earlier too, but now with the vast amount and variety of data , you
can train models more effectively and recommend the product to your customers with more
precision. Wouldn’t it be amazing as it will bring more business to your organization?
Let’s take a different scenario to understand the role of Data Science in desision making.
How about if you car had the intelligence to drive you home? The self-driving cars collect
live data from sensors, including radars, cameras and lasers to create a map of its
surrounding. Based on this data it takes decisions like when to speed up when to speed down,
when to overtake, where to take a turn – making use of advance machine learning algorithms.
Let’s see how Data Science can be used un predictive analytics. Let’s take weather
forecasting as an example. Data from ships, aircrafts, radars, satellites can be collected and
analyzed to build models. These models will not only forecast the weather but also help in
predicting the occurrence of any natural calamities. It will help help you to take appropriate
measures beforehand and save many precious lives.
1.1 What is Data Science?

Use of the term Data Science is increasingly common, but what does it exactly mean? What
skills do you need to become Data Scientist? What is the difference between BI and Data
Science? How aredecisions and predictions made in Data Science? These are some of the
ques that will be answered further.
First, let’s see what is Data Science is a blend of various tools, algorithms, and machine
learning principles with the goal to discover hidden patterns from the raw data. How is this
different from what statisticians have been doing for years?
So, Data cience is primarily used to make decisions and predictions making use of predictive
casual analytics, prescriptive analytics (predictive plus decision science) and machine
learning.
Predictive casual analytics – If you want model which can predict the possibilities of a
particular event in the future, you need to apply predictive casual analytics. Say, if you are
6
providing money on credit, then the probability of customers making future credit payments
on time is a matter of concern for you. Here, you can build a model which can perform
predictive analytics on the payment history of the customer to predict if the future payments
will be on time or not.
Prescriptive analytics – If you want a mdel which has the intelligence of taking its own
decisions and the ability to modify it with dynamis parameters, you certainly need
prescriptive analytics for it. This relatively new field is all about providing advice. In other
terms, it not only predicts but suggests a range of prescribed actions and associated outcomes.
The best example for this is Google’s self-driving car. The data gathered by vehicles can be
used to train self-driving cars. You can run algorithms on this data to bring intelligence to it.
This will enable your car to take decisions like when to turn, which path to take, when to
slow down or speed up.
Machine learning for making predictions – If you have transactional data of finance
company and need to build a model to determine the future trend, trend machine learning
algorithms are the best bet. This falls under the paradigm of supervised learning. It is called
supervised because you already have the data based on which you can train your machines.
For example, a fraud detection model can be trained using a historical record of fraudlent
purchases.
Machine learning for pattern discovery – If you don’t have the parameters based on whixh
you can make predictions, then you need to find out the hidden patters within the dataset to
be able to make meaningful predictions. This is nothing but the unsupervised model as you
don’t have any predefined labels for grouping. The most common algorithm used for pattern
discovery is Clustering.
Let’s say you are working in a telephone company and you need to establish a network by
putting towers in a region. Then, you can use the clustering technique to find those tower
locations which will ensure that all the users receive optimum signal strength
7
CHAPTER-II
TECHNOLOGY USED
 Core Python
 Advance python
 Algorithms + Statics
 Data Analysis
 R Language
 Bigdata Hadoop
 Data Science
1. Core Python : Introduction to Python, Basic Syntax, Data Types, Variables, Operators,
Input/output, Flow of Control (Modules, Branching), If, If-else, Nested if-else, Looping,
For, While, Nested loops, Control Structure, Break, Continue, Pass, String and Tuples,
Accessing list, assign and retrieve values from Lists, Intoducing Tuples, Accesing tuples
Operations, Functions and Functional Programming, Declaring and calling Functions,
Special Functions in python lambda, map and reduce. Advance functions in python var
length arguments and Closures and Decoratos. Namespace and Generator and Iterators.
2. Advance Python: Object Oriented, OOPs concept, Class and object, Attributes,
Inheritance, Overloading, Overriding, Data hiding, Meta Classes, Shared Memory
concepts, Exception Handling, except clause, Try finally clause, User Defined
Exceptions, Debugging mdules pdb, doctest and loggers. Python Libraries – NUMPY,
SCIPY, PANDAS, Scikit-Learn, matplotlib, bs4 etc.
3. Probability & Statistics: Introduction to Statistics- Descriptive Statistics, Summary
Statistics Basic probability theory, Statistical Concepts (uni-variate and bi-variate
sampling, distributions, re-sampling, statistical Inference, prediction error), Probability
Distribution (Continuous and discrete – Normal, Bernoulli, Binomial, Negative Binomial,
Geometric and Poisson distribution), Bayes’ Theorem, Central Limit theorem, Data
Exploration & preparation, Concepts of Correlation, Regression, Covariance, Outliers etc.
4. R Programming: Introduction & installation of R, R Basics, Finding Help, Code Editors
for R, Command Packages, Manipulating and Processing Data in R, Reading and Getting
Data into R, Exporting Data from R, Data Objects-Data Types & Data Structure. Viewing
8
Named Objects, Structure of Data Items, Manipulating and Processing Data in R
(Creating, Accessing, Sorting data frames, Extracting, Combining, Merging, reshaping
data frames), Control Structures, Functions in R (numeric, character, statistical), working
with objects, Viewing Objects within Objects, Constructing Data Objects, Building R
Packages, Running and Manipulating Packages, Non parametric Tests – ANOVA, chi-
Square, t-Test, U-Test, Introduction to Graphical Analysis, Using Plots(Box plots, Scatter
plot, Pie Charts, Bar charts, Line Chart), Plotting variables, Designing Special Plots,
Simple Liner Regression, Multiple Regression
Project: Market Basket Analysis, Housing Price Predictions, Student Evalution.
5. Big Data: Introduction to Big Data- Big data definition, enterprise / structured data,
social / unstructured data, unstructured data needs for analytics, What is Big Data, Big
Deal about Big Data, Big Data Sources, Industries using Big Data, Big Data challenges.
Hadoop: Introduction of Big data programming-Hadoop, History of Hadoop, The

ecosystem and stack, The Hadoop Distributed File System (HDFS), Components of
Hadoop, Design of HDFS, Java interfaces to HDFS, Architecture overview, Development
Environment, Hadoop distribution and basic commands, Eclipse development, The HDFS
command line and web interfaces, The HDFS Java API (lab), Analyzing the Data with
Hadoop event stream processing, complex event processing, MapReduce Introduction,
Developing a Map Reduce Application, How Map Reduce Works, The MapReduce
Anatomy of a Map Reduce Job run, Failure, Job Scheduling, Shuffle and sort, Task
execution, Map Reduce Types and Formulas, Map Reduce Features, Real-World
MapReduce.
Hadoop ETL: Hadoop Etl Development, ETL Process in Hadoop, Discussion ofETL
function, Data Extractions, Need of ETL tools, Advantages of ETL tools.
Pig and HIVE: Programming Pig: Engine for executing data flows in parallel on
Hadoop, Programming with Hive: Data warehouse syetem system for Hadoop,
Optimizing with Combiners and Partitioners (lab), More common algorithms: sorting,
indexing and searching (lab), Realtional Manipulation: map-sideand reduce-side joins
(lab), evolution, pupose and use, HDFS – Overiew and concepts, data flow (read and
write), interface to HDFS (HTTP, CLI and Java API), high availability and Name Node
ferderation, Map Reduce developing and deployoing programs, optimization techniques,
9
Map Reduce Anatomy, Data flow framework programming Map Reduce Anatomy, Data
flow framework programming Map Reduce best practices and debugging, Introduction to
Hadoop ecosystem, integration R with Hadoop.
Hadoop Environment: Setting up a Hadoop Cluster, Cluster specification, Cluster Setup

and Installation, Hadoop Configuration, Security in Hadoop, Administering Hadoop,
HDFS - Monitoring & Maintenance, Hadoop benchmarks, Hadoop in the cloud.
10
CHAPTER-III
TECHNOLOGY DETAILS AND WORKING
TECHNOLOGY USED :
 Ggplot2 :
This is the backbone of this project. Ggplot2 is the most popular data visualization library
that is most widely used for creating aesthetic visualization plots.
 Ggthemes :
This is more of an add-on to our main ggplot2 library. With this we can create better extra
themes and scales with the mainstream ggplot2 package.
 Lubricate :
Our dataset involeves various time-frames. In order to understand our data in separate time
categories, we will make use of the lubricate package.
 Dplyr :
This package is the lingua franca of data manipulation in R.
 Tidyr :
This package will help you to tidy your data. The basic principle of tidyr is to tidy the
columns where each variable is present in a column, each observation is represented by
roe and each value depicts a cell.
11
CHAPTER-IV
APPLCATIONS AND CASE STUDY
CODE :
INPUT SCREENSHOT 1:
INPUT SCREENSHOT 2:
INPUT SCREENSHOT 3:
12
INPUT SCREENSHOT 4:
INPUT SCREENSHOT 5:
13
OUTPUT SCREENSHOT 1:
OUTPUT SCREENSHOT 2:
14
OUTPUTS :
15
16
17
18
19
CONCLUSION
At the end of the Uber data analysis R projrct, we observe how to create data visualisation.
We made use of packages like ggplot2 that allowed us to plot various types of visualizations
that pertained to several time affected customer trips.
REFERENCES
1. Webpage : https://data-flair.training/blogs/r-data-science-project-uber-data-analysis/
2. David Beazley, “The Python Journey”, “Python Cookbook”, , 2774 pages
20

PTS Report Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PTS Report Final

Uploaded by

Copyright:

Available Formats

CERTIFICATE

SEMINAR COORDINATOR Dr. Akhil Pandey

4th year 7th Semester (16EARCS746)

Department of Computer Science

Arya College of Engineering & IT

Let’s Understand Why We Need Data Science

1.1 What is Data Science?

Hadoop: Introduction of Big data programming-Hadoop, History of Hadoop, The

Hadoop Environment: Setting up a Hadoop Cluster, Cluster specification, Cluster Setup

You might also like