A Complete 52 Week Course To Become A Data Scientist in 2021

Sign in Get started
Follow 619K Followers · Editors' Picks Features Deep Dives Grow Contribute About
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
A Complete 52 Week Course to

Become a Data Scientist in 2021
Learn something every week for 52 weeks!
Terence Shin Dec 23, 2020 · 10 min read
Photo by Ivan Diaz on Unsplash
“Everyone wants to eat, but few are willing to hunt”
Be sure to subscribe here or to my exclusive newsletter to never miss

another article on data science guides, tricks and tips, life lessons,
and more!
Introduction
If you want to be a data scientist but haven’t yet made the commitment,
now is the time.
Last year, I made a commitment to learn something new about data

science every week for 52 weeks, and I think that was one of the best
decisions I’ve ever made. You’d be surprised at how far you can get in a
year.
And so, I’m presenting to you a complete 52 weeks curriculum that you
can do in 2021 as a new years resolution! It is crammed and it will be a
grind, but it will be worth it.
Immediately, you’ll notice that this guide does not start with machine
learning, and I have good reasons for that. If you want to learn more
about why machine learning comes in the later weeks, check out my
article:
Want to Be a Data Scientist? Don’t Start With Machine Learning.

The biggest misconception aspiring data scientists have
towardsdatascience.com
A couple of notes before we dive into it:
This will not cover everything that you need to know to be a fully
equipped data scientist. That being said, this will cover what I believe are
the most fundamental skills of a data scientist.
This assumes that you already know differential calculus since we all
learned it in high school
This curriculum will not include anything related to deep learning. Deep
learning deserves its own 52 weeks on its own — it would be a disservice if
I tried to squeeze it in!
With that said, let’s dive into it!

and more!
Course Structure
1. Statistics and Probability (Week 1 to Week 6)
2. Mathematics (Week 7 to 12)
3. SQL (Week 13 to Week 21)
4. Python and Programming (Week 22 to Week 28)
5. Pandas (Week 29 to Week 33)
6. Visualizing Data (Week 34 to Week 35)
7. Data Exploration and Preparation (Week 36 to Week 39)
8. Machine Learning (Week 40 to Week 51)
9. Data Science Project (Week 52)
Statistics & Probability

Why Statistics and Probability?
Data science and machine learning are essentially a modern version of
statistics. By learning statistics first, you’ll have a much easier time when
it comes to learning machine learning concepts and algorithms! Even
though it may seem like you’re not getting anything tangible out of the
first few weeks, it will be worth it in the later weeks.
Week 1: Descriptive Statistics

Intro to Descriptive Statistics
Another Intro to Descriptive Statistics
Week 2: Probability
Theoretical probability
Sample spaces
Set operations
Addition rule
Multiplication rule for independent events
Multiplication rule for dependent events
Conditional probability and independence
Week 3: Combinations and Permutations

Counting principle and factorial
Permutations
Combinations
Combinatorics
Week 4: Normal Distribution and Sampling Distributions

Normal distribution and the Empirical rule
Introduction to Sampling Distributions
Sampling distribution of a sample proportion
Sampling distribution of a sample mean
Week 5: Confidence Intervals

Introduction to Confidence Intervals
Estimating Sample Proportions
Estimating Sample Means
Week 6: Hypothesis Testing

Introduction to Hypothesis Testing
Error probabilities and power
Tests about a population proportion
Tests about a population mean
More videos
Mathematics
Why Mathematics?
Like statistics, many data science concepts build on fundamental
mathematical concepts.
In order to understand cost functions, you need to know differential

calculus. In order to understand hypothesis testing, you need to
understand integration. And to give more one more example, linear
algebra is essential to learning deep learning concepts, recommendation
systems, and principal component analysis.
Week 7: Vectors and Spaces

Vectors
Linear Combinations and Spans
Linear Dependence and Independence
Subspaces and the basis for a subspace
Week 8: Dot Product and Matrix Transformations pt. 1

Vector dot and cross products
Functions and Linear Transformations
Transformations and Matrix Multiplications
Week 9: Matrix Transformations pt. 2

Inverse Functions and Transformations
Inverses and Determinants
Transpose of a Matrix
Week 10: Eigenvalues and Eigenvectors

Eigenvalues and Eigenvectors
Anything that you couldn’t finish in the past few weeks!
Week 11: Integrals

Approximation with Riemann Sums
Definite Integrals with Riemann Sums
The Fundamental Theorem of Calculus and Accumulation Functions
Properties of Definite Integrals
Week 12: Integrals Part 2!

The Fundamental Theorem of Calculus and Definite Integrals
Reverse Power Rule
Indefinite Integrals of Common Functions
Definite Integrals of Common Functions
Be sure to subscribe here or to my exclusive newsletter to never

miss another article on data science guides, tricks and tips, life
lessons, and more!
SQL
Why SQL?
SQL is arguably the most important skill to learn across any type of data-
related profession, whether you’re a data scientist, data engineer, data
analyst, business analyst, the list goes on.
At its core, SQL is used to extract (or query) specific data from a
database, so that you can do things like analyze the data, visualize the
data, model the data, etc. Therefore, developing strong SQL skills will
allow you to take your analyses, visualizations, and modeling to the next
level because you will be able to extract and manipulate the data in
advanced ways.
I came across Mode’s curriculum a while back and it is fantastic! So I

would first get familiar with using SQL in Mode and then you’ll be able to
go through the topics below!
Week 13: Basic SQL

Introduction to SQL
SELECT FROM statement
WHERE statement
ORDER BY statement
LIMIT statement
Week 14: LOGICAL and COMPARISON Operators

Logical Operators
Comparison Operators
Week 15: AGGREGATES

Aggregate Functions (COUNT, SUM, MIN/MAX, AVG)
GROUP BY clause
HAVING clause
Week 16: DISTINCT, CASE WHEN

DISTINCT
CASE WHEN
Week 17: JOINS and UNIONS

JOINs
UNIONs
Week 18: Subqueries and Common Table Expressions

Subqueries
Common Table Expressions (CTEs)
Week 19: String Manipulations

String Functions in SQL (LEFT/RIGHT, TRIM, STRPOS, SUBSTR,
CONCAT, UPPER/LOWER, etc…)
Week 20: Date-time manipulation

EXTRACT
DATE_ADD()
DATE_SUB()
DATE_DIFF()
See here for more functions (on the left of the webpage)
Week 21: Windows Functions

Windows Functions (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG,
LEAD, SUM, COUNT, AVG)
See here for advanced window functions.
Python and Programming

Why Python?
I started with Python, and I’ll probably stick with Python for the rest of
my life. It’s so far ahead in terms of open source contributions, and it’s
straightforward to learn. Feel free to go with R if you want, but I have no
opinions or advice to provide regarding R.
Week 22: Introduction to Python

Video: (0:00 to 1:03:10)
Week 23: List, Tuples, Functions, Conditional Statements,

Comparisons
Video: (1:03:10 to 2:00:37)
Week 24: Dictionaries, Loops, Comments

Video: (2:00:37 to 3:04:17)
Week 25: Try/Except, Reading & Writing files, Classes and Objects
Video: (3:04:17 to 4:20:43)
Week 26: Recursion

Video explanation of recursion
Course on recursion
Week 27: Binary Trees

Course on Binary Trees
Week 28: APIs and Anaconda

APIs for beginners
Getting Anaconda setup on your computer
Pandas
Why Pandas?
Arguably the most important library to know in Python is Pandas, which
is specifically meant for data manipulation and analysis.
Week 29: Getting and Knowing your data

Follow along and learn — YouTube video
Practice problem set #1
Week 30: Filtering and Sorting

Week 31: Grouping

Week 32: Apply

Week 33: Merge

Visualizing Data
Why Data Visualizations?
The ability to visualize data and insights is so important because it’s the
easiest way to communicate intricate information and a lot of
information at once. As a data scientist, you’re always selling yourself
and your ideas, whether your pitching a new project or convincing others
why your model should be productionalized — data visualizations are a
great tool to help you with that.
There are dozens of data visualization libraries out there, but I’m going
to focus on two: Matplotlib and Plotly.
Week 34: Data Visualizations with Matplotlib

Introduction to Matplotlib
3-D Visualizations in Matplotlib
Types of Data Visualizations in Matplotlib
Cheatsheet
Week 35: Data Visualizations with Plotly

Types of Visualizations in Plotly (beginner)
Types of Visualizations in Plotly (beginner and advanced)
Data Exploration and Preparation

Why Data Exploration and Preparation?
“Garbage in, garbage out”
The models that you create can only be as good as the data that you feed
into it. To understand what the state of your data is in, i.e. whether it’s
“good” or not, you have to explore the data and prepare the data.
Therefore, for the next four weeks, I’m going to provide several amazing
resources for you to go through and get a better understanding of what
data exploration and preparation entails.
Week 36: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) can be difficult because there’s no one
set way of doing it — but that’s also what keeps it exciting. Generally, you
want to…
Derive descriptive statistics (eg. central tendency)
Perform uni-variable analysis (distributions and spread)
Perform multi-variable analysis (scatterplots, correlation matrix,

predictive power score, etc…)
Check for missing data and check for outliers
Check out a beginner’s guide to EDA here.
Week 37: Data Preparation: Feature Imputation and Normalization

What is Feature Imputation?
6 ways to impute missing data
Normalization vs Standardization
Example of implementing normalization vs standardization
Week 38: Feature Engineering and Feature Selection

Feature Engineering Mini Course
Feature Encoding
An Introduction to Feature Selection
Week 39: Imbalanced Datasets

An Introduction to Imbalanced Classification Problems
The Right Way to Oversample in Predictive Modeling
Machine Learning
Why Machine Learning?
Everything that you’ve learned has led up to this point! Not only is
machine learning interesting and exciting, but it is also a skill that all data
scientists have. It’s true that modeling makes up a small portion of a data
scientist’s time, but it doesn’t take away from its importance.
Later in your career, you might notice that I left out some machine
learning algorithms, like K Nearest-Neighbors, Gradient Boost, and
CatBoost. This is completely intentional — if you can understand the
following machine learning concepts, you’ll have the skills to learn any
other machine learning algorithms in the future.
Week 40: Introduction to Machine Learning

Supervised vs Unsupervised, Continuous vs Discrete
Bias-variance tradeoff
Week 41: Linear Regression

Linear Models: Linear Regression
Linear Models: Multiple Regression
Mathematics behind linear regression
Week 42: Logistic Regression

Introduction to Logistic Regression
Part 1: Coefficients
Part 2: Maximum likelihood
Part 3: R-squared and P-value
Week 43: Regularization

Ridge Regression (L2)
Lasso Regression (L1)
Elastic Net Regression
Week 44: Decision Trees

Decision Trees Introduction
Feature Selection and Missing Date
Implementing a Decision Tree in Python
Week 45: Naïve Bayes

A Mathematical Explanation of Naïve Bayes
Naïve Bayes (StatQuest)
Week 46: Support Vector Machines

Intuition of Support Vector Machines
Support Vector Machines in Python
A mathematical explanation of Support Vector Machines
Week 47: Clustering

K-means clustering
Hierarchical clustering
Week 48: Principal Component Analysis

Principal Component Analysis (PCA) step-by-step
Another detailed explanation by Luis Serrano (I highly suggest you

watch both)
Mathematical explanation of PCA
Week 49: Bootstrap Sampling, Bagging, and Boosting

Bootstrap Sampling
Ensemble learning, Bagging, Boosting
Week 50: Random Forests and Other Boosted Trees

Random Forests pt.1
Random Forests pt.2
XGBoost — Regression
XGBoost — Classification
XGBoost — Mathematical Details
XGBoost in Python
Week 51: Model Evaluation Metrics

Evaluation Metrics with Python Code
Understanding the confusion matrix and how to implement it in

Python
Week 52: Data Science Project

If you feel comfortable with the materials above, you’re definitely ready
to start your own data science project! Just in case, I’ve provided three
ideas that you can use as inspiration to get started, but feel free to do
whatever you like.
Idea 1: SQL Case Study

Link to the case.
The objective of this case is to determine the cause for a drop in user
engagement for a social network called Yammer. Before diving into the
data, you should read the overview of what Yammer does here. There are
4 tables that you should work with.
The link to the case above will provide you with much more detail
pertaining to the problem, the data, and the questions that should be
answered.
Check out how I approached this case study here if you’d like guidance.
Skills You’ll Develop
SQL
Data Analysis
Data Visualization if you choose to visualize your insights.
Idea 2: Trustpilot Webscraper

Learning how to webscrape data is simple to learn and extremely useful,
especially when it comes to collecting data for personal projects.
Scraping a customer review website, like Trustpilot, is valuable for a
company as it allows them to understand review trends (getting better or
worse) and see what customers are saying via NLP.
First I would get familiar with how Trustpilot is organized, and decide
upon which kinds of businesses to analyze. Then I would take a look at
this walkthrough of how to scrape Trustpilot reviews.
Writing Python Scripts
Data Wrangling
BeautifulSoup/Selenium (webscraping libraries)
Data Analysis
Take it further and apply NLP to extract insights from reviews.
Idea 3: Titanic Machine Learning Competition

In my opinion, there’s no better way of showing that you’re ready for a
data science job than to showcase your code through competitions.
Kaggle hosts a variety of competitions that involves building a model to
optimize a certain metric, one of them being the Titanic Machine
Learning Competition.
If you want to get some inspiration and guidance, check out this step-by-
step walkthrough of one of the solutions.
Data Exploration and Cleaning with Pandas
Feature Engineering
Machine Learning Modelling
Thanks for Reading!

and more!
I hope you found this useful! If you managed to get through this, you
should have a strong understanding of the fundamentals in Statistics,
Mathematics, SQL, Python/Pandas, and several machine learning
algorithms!
I hope this inspired you to continue learning too — there are so many
things that you can continue to explore like more advanced models (eg.
CatBoost), deep learning, experimental design, Bayesian modeling, cloud
architecture, and the list goes on.
If you like this and want to see future content, be sure to give me a follow
on Medium. And as always, I wish you the best in your data science
endeavors.
Not sure what to read next? I’ve picked another article for you:
Want to Be a Data Scientist? Don’t Start With Machine Learning.

The biggest misconception aspiring data scientists have
and another!
12 Data Science Projects for 12 Days of Christmas

Relevant and valuable data science projects that you can do in a day!
Terence Shin
If you enjoyed this, follow me on Medium for more
Sign up for my email list here!
Let’s connect on LinkedIn
Interested in collaborating? Check out my website
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.
Get this newsletter
Data Science Machine Learning Programming Education Artificial Intelligence
4.1K 20
More from Towards Data Science Follow
Your home for data science. A Medium publication sharing concepts, ideas and
codes.
Read more from Towards Data Science
More From Medium
What Makes A Successful A Word about Mr Jatan A Beginner’s Guide to Big Data Cleaning 101
Data Scientist? 5 Traits to ShaSkillnation.in Data Testing Jeffrey Ng in The Startup
Success Jacob Joe Nadeesha Liyanage in
Sara A. Metwalli in Towards Engineering at 99x
Data Science
5 Datasets to Inspire Your Question Answering on Airbnb Seattle: Weekend (Vaccine) Data explained:
Next Data Science Project Scientific Research Price Variation Review is “33% less infected”
Sara A. Metwalli in Towards
Papers matze
good enough?
Data Science Pradeep Dasigi in AI2 Blog AI Explained
About Write Help Legal

A Complete 52 Week Course To Become A Data Scientist in 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Complete 52 Week Course To Become A Data Scientist in 2021

Uploaded by

Copyright:

Available Formats

Sign in Get started

A Complete 52 Week Course to

Terence Shin Dec 23, 2020 · 10 min read

Photo by Ivan Diaz on Unsplash

“Everyone wants to eat, but few are willing to hunt”

Be sure to subscribe here or to my exclusive newsletter to never miss

Last year, I made a commitment to learn something new about data

Want to Be a Data Scientist? Don’t Start With Machine Learning.

A couple of notes before we dive into it:

With that said, let’s dive into it!

Be sure to subscribe here or to my exclusive newsletter to never miss

2. Mathematics (Week 7 to 12)

3. SQL (Week 13 to Week 21)

4. Python and Programming (Week 22 to Week 28)

5. Pandas (Week 29 to Week 33)

6. Visualizing Data (Week 34 to Week 35)

7. Data Exploration and Preparation (Week 36 to Week 39)

8. Machine Learning (Week 40 to Week 51)

9. Data Science Project (Week 52)

Statistics & Probability

Week 1: Descriptive Statistics

Another Intro to Descriptive Statistics

Multiplication rule for independent events

Multiplication rule for dependent events

Conditional probability and independence

Week 3: Combinations and Permutations

Week 4: Normal Distribution and Sampling Distributions

Introduction to Sampling Distributions

Sampling distribution of a sample proportion

Sampling distribution of a sample mean

Week 5: Confidence Intervals

Estimating Sample Proportions

Estimating Sample Means

Week 6: Hypothesis Testing

Error probabilities and power

Tests about a population proportion

Tests about a population mean

In order to understand cost functions, you need to know differential

Week 7: Vectors and Spaces

Linear Combinations and Spans

Linear Dependence and Independence

Subspaces and the basis for a subspace

Week 8: Dot Product and Matrix Transformations pt. 1

Functions and Linear Transformations

Transformations and Matrix Multiplications

Week 9: Matrix Transformations pt. 2

Inverses and Determinants

Week 10: Eigenvalues and Eigenvectors

Anything that you couldn’t finish in the past few weeks!

Week 11: Integrals

Definite Integrals with Riemann Sums

The Fundamental Theorem of Calculus and Accumulation Functions

Properties of Definite Integrals

Week 12: Integrals Part 2!

Reverse Power Rule

Indefinite Integrals of Common Functions

Definite Integrals of Common Functions

Be sure to subscribe here or to my exclusive newsletter to never

I came across Mode’s curriculum a while back and it is fantastic! So I

Week 13: Basic SQL

SELECT FROM statement

Week 14: LOGICAL and COMPARISON Operators