Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

4/14/2014 ActivSteps

http://activsteps.com/PraticalDataScience.html 1/11
PRACTICAL DATA SCIENCE
Module 1: Data Science Essentials
Unit 1 - Introduction to Data Science
What is Data Science?
Disciplines that make up Data Science
What does a data scientist do with the data?
Pre-requisites and Resources ( Statistics, Mathematics, Computer
Science)
Business Modeling
Why should we build models or use data to run a business?
Tribal knowledge/Intuition vs. Evidence
What kind of models do data scientists build? Limitations of
Models
What problems need a prediction?
How do you evaluate the accuracy of predictions
Understanding Data
Understanding Data Types
Data Preprocessing and Transformation
Approximations and estimations
Analyzing networks and graphs
Representing data Graphically and analytically
Data Science Applications
Churn Analysis
Data Preprocessing and Transformation
Recommendations
Pattern recognition and learning algorithms
HOME CUSTOMERS EDUCATION RESOURCES ABOUT US
JOBS

4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 2/11
Unit 2 - Reviving with R (Tutorials)
Basic Data Types
Vector
Matrix
List
Data Frame
Data Import and export
Control Structures
Some important R Packages (Plyer, apply)
Unit 3 - Python for Data Analysis
IPython IDE
NumPy Basics
Pandas Basics
Data handling ( Loading, Storage and file formatting)
Data Wrangling (Clean, Transform, Merging)
Handling missing values
Binning, Classing and Standardization
Outlier/Noise
Type Conversion
Unit 4 - Data Analysis
Explorative Basic Data Analysis
Data exploration (histograms, bar chart, box plot, line graph,
scatter plot)
Qualitative and Quantitative Data
Central Tendencies : Mean, Median, Mode
Dispersion : Range, Variance, Standard Deviation
Anscombe's quartet
Other Measures : Quartile and Percentile, Interquartile Range,
Skew and Kurtosis
Relationship between attributes : Covariance, Correlation
Coefficient, ChiSquare
Moment Generating Functions (Random Data)
Principal Component Analysis
Unit 5 - Data Visualization
The science and the art
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 3/11
Science of Visualization
Visualization Periodic Table
Aesthetics and Story telling
Data Visualization using d3.js, Google AppEngine & Charts and Ggplot2
Bubble charts
gauge charts
Tree map
Heat map
Motion charts
Force Directed Charts etc.
Unit 6 - Processing Unstructured Data
Text Pre-processing
Regular Expressions
Sentence Splitting and Tokenization
Find Unique words and count
Punctuations and Stopwords
Incorrect spellings
Basic Natural Language Processing
Properties of words
Lemmatization and Term-Document TxD computation
Bag-of-words
Similarity measures (Cosine Similarity, Chi-Square, N Grams)
Part-of-Speech Tagging
Stemming
Chunking
Module 2: Computing at Scale
Unit 7 - Processing Big Data
Essential Hadoop
Distributed Computing for Scale and Price/Performance
HDFS Overview and Architecture
Functional Programming model
Evolution and overview of MapReduce
MapReduce Data flow
Working with Hadoop
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 4/11
Different types of Installations.
Demo VM Image setup
Linux and HDFS
Unit 8 - MapReduce (MR) Programming
Core MR Programming
Hadoop Data Types
Basic MapReduce API Concepts
Input Splits, Shuffling, Sorting, Combining
Custom Writable & WritableComparables
Combiners & Partitioners
Streaming (in Python)
Streaming (in Python)
Word Co-occurrence and N-grams
Inverted Index
TF-IDF
Page Rank
Unit 9 - MR Algorithms for Data Scientists
Map Reduce Applications
Graph Processing
Sample ML Algorithm
Pandas Basics
Hadoop eco-system
Sqoop
Flume
Mahout
Unit 10 - Pig : Dataflow Language
Introduction to Pig
Pig Data Model
Input and Output
Relational Operations
User Defined Functions
Pig (Tutorial)
Unit 11 - Hive : Datawarehouse Framework
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 5/11
Introduction to Hive
Hive Architecture
Data Definition and Manipulation
Data Model
Data Handling and Modeling
Pig and Hive Comparison
Hive (Tutorial)
Unit 12 - NoSQL Concepts
NoSQL Introduction
ACID vs. BASE
CAP Theorem
NoSQL DBs (Key-value, Columnar, Document, Graph)
NoSQL Modeling
Relational Schema to Key Value and Document Stores
Relational Schema to Graph Stores
Designing a NoSQL Data base for Twitter
Building an Application with NoSQL
Designing a Social Media application
Exploring the design of Twitter.com
Unit 13 - HBase and Neo4j
HBase Overview
HBase Concepts , Architecture
Data Model
Hbase Commands
Neo4j Overview
Concepts and Data Modeling
Cypher Query language
Graph Search and applications
Unit 14 - Cassandra and MongoDB
Cassandra Overview
Cassandra Concepts, Architecture
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 6/11
The Cassandra Data Model
Introduction to Clusters
Gossip and Failure Detection
Compaction, Bloom Filters, Tombstones
Reading and Writing Data
Cassandra Commands
MongoDB Overview
Concepts and Architecture
Schema Design
Data Manipulation - CRUD Operations
Aggregations
Module 3: Predictive Analytics
Unit 15 - Statistical Thinking
Probability Concepts
Statistical Distributions
Normal Distribution (when data is continuous numeric variable)
Binomial Distribution (when responses data is binary)
Poisson Distribution (when data is counts based)
Exponential Distribution (useful for survival analysis kind of data)
Central Limit Theorem
Analysis of Variance (ANOVA)
Bayesian Statistics
Bayian analysis
Prior probability (Naive Density Estimator)
Conditional probability(Joint Density Estimator)
Comparing two proportions
Posterior probability
Bayes Theorem
Useful Statistical Inferences about Business Outcomes
Concepts of Hypothesis Testing
Testing for equality of variances of two samples
Comparing the equality of means of two samples
Comparing two proportions
Correlation between two samples
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 7/11
Tests on two variables contingency table
Unit 16 - Statistical Analysis
Introduction to Regression
Regression (Linear, Multivariate Regression) in forecasting
Analyzing and interpreting regression results
Multi-collinearity
Logistic Regression
Forecasting
Trend analysis and Time Series
Cyclical and Seasonal analysis
Box-Jenkins method
Smoothing and Moving averages
Auto-correlation
ARIMA - Holt-Winters method
Sales Prediction
Time Series of Decomposition of Cement Sales by quarter.
Predicting the sales for the next four quarters
Module 4: Text Mining and Machine Learning
Unit 17 - Applications of Text Analysis
Fundamentals of Information Retrieval
Data Collection and Structuring
Tools and Techniques for Data Collection from Facebook, Twitter,
etc.,
Data Storage Options, Standardization and Preparation for
Analysis
Text classification and feature selection:
How to use Naive Bayes classifier for text classification
Evaluation systems on the accuracy of text mining
Location Sensitive Hashing
Applications
An introduction to text mining for sentiment analysis.
Text Analysis (Tutorial)
Twitter and Email Analysis
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 8/11
Social Graphs and Segmentation
Spam Filtering or Text Classification
Shakespeare Text Analysis
Unit 18 - Social Media Analysis
Data Collection and Structuring
Tools and Techniques for Data Collection from Facebook, Twitter,
etc.,
Data Storage Options, Standardization and Preparation for
Analysis
Mining Social Media
Topic Mining and Trending
Social Media Analytics: Clustering, Regression etc
Twitter Analysis with Mahout
Social Graphs (Neighbor analysis and Community Detection)
Unit 19 - Introduction to Machine Learning
Data Mining and Machine Learning
Decision Trees
Recommender Systems
User based
Item Based
Singular value decomposition-based recommenders
Text classification and feature selection:
How to use Naive Bayes classifier for text classification
Evaluation systems on the accuracy of text mining
Location Sensitive Hashing
Similarity Measures :
Pearson correlation
Spearman correlation
Euclidean distance
Cosine measure
Tanimoto coefficient
Log-likelihood test
Cases
Decision Tree Classifier ( Iris Data Set)
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 9/11
Movie Ratings and recommendations
Unit 20 - Predictive Modeling
Pre-processing for Analytics
Creating standard data sets - Training, Testing and Validation
Data Sampling and methods (Data reduction, Modeling, Balancing,
Over/Under-sampling)
Feature selection (Feature creation, bundling, ranking)
Analyzing the goodness of Models
Structure and anatomy of models
Characteristics of good models
Concepts of under-fitting and over-fitting of models
SEvaluating Performance of a model
Likelyhood-ratio
Scoring and Bagging
Similarity Measures :
Confusion matrix
ROC curve
Lift
KS test
Unit 21 - Supervised Learning
Classification
Naive Bayes classifier
Bayesian belief networks
Unit 22 - Unsupervised Learning
Clustering
Similarity measures for grouping objects
Connectivity models (hierarchical clustering)
Partition Clustering
Analyzing clustering results
Using Clustering for Prediction
Clustering Techniques
K-Nearest Neighbor method
Wilson editing and triangulations
K-nearest neighbors in collaborative filtering, digit recognition
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 10/11
Mentors
ReddyRaja Annareddy
Founder and CEO, Akrantha Software; Faculty IIIT-Hyd
Surya Putchala
Founder and CEO Zettamine
Pavan Kumar Penjandra
Unit 23 - Other Techniques
Linear learning machines
Prediction using Linear Regression
Parameters Estimation in Regression
Analysis of Regression Results
Use of Regression in Analysis of Variance
Support Vector Machines (SVM)
Ensemble Techniques
Ensemble and Hybrid models
AdaBoost, Random Forests and Gradient boosting machines
Neural Networks and its applications
Perceptron and Single Layer Neural Network, and hand
calculations
Back propagation and conjugant gradient techniques
Applications : Face and Digit Recognition
Face Recognition with SVD, Eigen vectors
Unit 24 - Tackling a Data Science Project
Data Science Development Framework
Cases : Community Detection, Recommender Engine for a Job Portal,
Ad Serving Platform
Analyzing the case study
Developing a Solution Architecture
Developing a Technical Architecture
Develop plan for the Tools and Algorithms
4/14/2014 ActivSteps
http://activsteps.com/PraticalDataScience.html 11/11
Big Data Engineer
Karthik Reddy
Data Scientist
2013 ACTIVSTEPS INC. ALL RIGHTS RESERVED. HOME CUSTOMERS MISSION CONTACT US

You might also like