Professional Documents
Culture Documents
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
Batch Vs Online ML
Wednesday, March 17, 2021 5:30 PM
1. Lots of Data
2. Hardware Limitation
3. Availability
1. Tricky to use
2. Risky
○ Build data pipelines/APIs for easy access to the ○ BIG DATA Tools (Apache Spark, Hadoop, Apache
data. Kafka, Apache Hive)
Skills
Responsibilities of a Data Analyst
• Statistical Programming
• Programming Languages
• Cleaning and organizing Raw data.
(R/SAS/Python)
• Analyzing data to derive insights.
• Creative and Analytical Thinking
• Creating data visualizations.
• Business Acumen — Medium to High
• Producing and maintaining reports.
preferred
• Collaborating with teams/colleagues based on
• Strong Communication Skills.
the insight gained.
• Data Mining, Cleaning, and Munging
• Optimizing data collection procedures
• Data Visualization
• Data Story Telling
• SQL
• Advanced Microsoft Excel
Responsibilities Skills
1. Watch time
2. Search but did not find
3. Content left in the middle
4. Clicked on recommendations(order of recommendations)
Feature engineering is the process of using domain knowledge to extract features from raw
data. These features can be used to improve the performance of machine learning algorithms.
Feature
Engineering
Feature
Transformation Feature
Feature
Feature
Selection Extraction
Construction
Missing Value
Imputation
Handling Categorical
Features
Outlier
Detection
Feature
Scaling
Pipelines chains together multiple steps so that the output of each step is used
as input to the next step.
Pipelines makes it easy to apply the same preprocessing to train and test!
Complete Case Analysis means literally analyzing only those observations for
which there is information in all of the variables in the dataset.
Advantage
1. Easy to implement as no data manipulation required
2. Preserves variable distribution (if data is MCAR, then the
distribution of the variables of the reduced dataset should
match the distribution in the original dataset
Disadvantage
1. It can exclude a large fraction of the original dataset (if
missing data is abundant)
2. Excluded observations could be informative for the
analysis (if data is not missing at random)
3. When using our models in production, the model will not
know how to handle missing data
1. Actual Dataset
Step 2 - Remove all col1 missing values Step 3 - Predict the missing values of col1 using
other cols
1. Normal Distribution
2. Skewed Distribution
3. Other Distributions
Day 43 - Outlier detection and removal using IQR method Page 170
Intro
Monday, May 3, 2021 11:49 AM
Never reaches 0
Predicted
1. Hierarchical Clustering
2. Density Based Clustering
1. Cross Platform
2. Multiple Language Support
3. Integration with other libraries and tools
4. Support all kinds of ML problems
1. Parallel Processing
2. Optimized Data Structures
3. Cache Awareness
4. Out of Core computing
5. Distributed Computing
6. GPU Support
1. Parallel Processing
2. Optimized Data Structures
3. Cache Awareness
4. Out of Core computing
5. Distributed Computing
6. GPU Support
1. Parallel Processing
2. Optimized Data Structures
3. Cache Awareness
4. Out of Core computing
5. Distributed Computing
6. GPU Support
Step 1 - Identify all points as either core point, border point or noise point
Step 2b - add all the points that are unclustered and density connected to the
current point into this cluster
Step 3 - For each unclustered border point assign it to the cluster of nearest core
point
1. Robust to outliers
2. No need to specify clusters
3. Can find arbitrary shaped clusters
4. Only 2 hyperparameters to tune
1. Sensitivity to hyperparameters
2. Difficulty with varying density clusters
3. Does not predict
Finance
Healthcare
Manufacturing
Consumer Internet
Environment Science
Advantages
1. Reduction in bias
2. Faster training
Disadvantages
Advantage
1. Reduced Bias
Disadvantage
1. Increased size
2. Duplication of data may cause overfitting
Disadvantages
2. Computational Complexity
4. Sensitive to Outliers