Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Sign in Get started

Follow 619K Followers · Editors' Picks Features Deep Dives Grow Contribute About

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

A Complete 52 Week Course to


Become a Data Scientist in 2021
Learn something every week for 52 weeks!

Terence Shin Dec 23, 2020 · 10 min read

Photo by Ivan Diaz on Unsplash

“Everyone wants to eat, but few are willing to hunt”

Be sure to subscribe here or to my exclusive newsletter to never miss


another article on data science guides, tricks and tips, life lessons,
and more!

Introduction
If you want to be a data scientist but haven’t yet made the commitment,
now is the time.

Last year, I made a commitment to learn something new about data


science every week for 52 weeks, and I think that was one of the best
decisions I’ve ever made. You’d be surprised at how far you can get in a
year.

And so, I’m presenting to you a complete 52 weeks curriculum that you
can do in 2021 as a new years resolution! It is crammed and it will be a
grind, but it will be worth it.
Immediately, you’ll notice that this guide does not start with machine
learning, and I have good reasons for that. If you want to learn more
about why machine learning comes in the later weeks, check out my
article:

Want to Be a Data Scientist? Don’t Start With Machine Learning.


The biggest misconception aspiring data scientists have
towardsdatascience.com

A couple of notes before we dive into it:

This will not cover everything that you need to know to be a fully
equipped data scientist. That being said, this will cover what I believe are
the most fundamental skills of a data scientist.

This assumes that you already know differential calculus since we all
learned it in high school

This curriculum will not include anything related to deep learning. Deep
learning deserves its own 52 weeks on its own — it would be a disservice if
I tried to squeeze it in!

With that said, let’s dive into it!

Be sure to subscribe here or to my exclusive newsletter to never miss


another article on data science guides, tricks and tips, life lessons,
and more!

Course Structure
1. Statistics and Probability (Week 1 to Week 6)

2. Mathematics (Week 7 to 12)

3. SQL (Week 13 to Week 21)

4. Python and Programming (Week 22 to Week 28)

5. Pandas (Week 29 to Week 33)

6. Visualizing Data (Week 34 to Week 35)

7. Data Exploration and Preparation (Week 36 to Week 39)

8. Machine Learning (Week 40 to Week 51)

9. Data Science Project (Week 52)

Statistics & Probability


Why Statistics and Probability?
Data science and machine learning are essentially a modern version of
statistics. By learning statistics first, you’ll have a much easier time when
it comes to learning machine learning concepts and algorithms! Even
though it may seem like you’re not getting anything tangible out of the
first few weeks, it will be worth it in the later weeks.

Week 1: Descriptive Statistics


Intro to Descriptive Statistics

Another Intro to Descriptive Statistics

Week 2: Probability
Theoretical probability

Sample spaces

Set operations

Addition rule

Multiplication rule for independent events

Multiplication rule for dependent events

Conditional probability and independence

Week 3: Combinations and Permutations


Counting principle and factorial

Permutations

Combinations

Combinatorics

Week 4: Normal Distribution and Sampling Distributions


Normal distribution and the Empirical rule

Introduction to Sampling Distributions

Sampling distribution of a sample proportion

Sampling distribution of a sample mean

Week 5: Confidence Intervals


Introduction to Confidence Intervals

Estimating Sample Proportions

Estimating Sample Means

Week 6: Hypothesis Testing


Introduction to Hypothesis Testing

Error probabilities and power

Tests about a population proportion

Tests about a population mean

More videos

Mathematics
Why Mathematics?
Like statistics, many data science concepts build on fundamental
mathematical concepts.

In order to understand cost functions, you need to know differential


calculus. In order to understand hypothesis testing, you need to
understand integration. And to give more one more example, linear
algebra is essential to learning deep learning concepts, recommendation
systems, and principal component analysis.

Week 7: Vectors and Spaces


Vectors

Linear Combinations and Spans

Linear Dependence and Independence

Subspaces and the basis for a subspace

Week 8: Dot Product and Matrix Transformations pt. 1


Vector dot and cross products

Functions and Linear Transformations

Transformations and Matrix Multiplications

Week 9: Matrix Transformations pt. 2


Inverse Functions and Transformations

Inverses and Determinants

Transpose of a Matrix

Week 10: Eigenvalues and Eigenvectors


Eigenvalues and Eigenvectors

Anything that you couldn’t finish in the past few weeks!

Week 11: Integrals


Approximation with Riemann Sums

Definite Integrals with Riemann Sums

The Fundamental Theorem of Calculus and Accumulation Functions

Properties of Definite Integrals

Week 12: Integrals Part 2!


The Fundamental Theorem of Calculus and Definite Integrals

Reverse Power Rule

Indefinite Integrals of Common Functions

Definite Integrals of Common Functions

Be sure to subscribe here or to my exclusive newsletter to never


miss another article on data science guides, tricks and tips, life
lessons, and more!
SQL
Why SQL?
SQL is arguably the most important skill to learn across any type of data-
related profession, whether you’re a data scientist, data engineer, data
analyst, business analyst, the list goes on.

At its core, SQL is used to extract (or query) specific data from a
database, so that you can do things like analyze the data, visualize the
data, model the data, etc. Therefore, developing strong SQL skills will
allow you to take your analyses, visualizations, and modeling to the next
level because you will be able to extract and manipulate the data in
advanced ways.

I came across Mode’s curriculum a while back and it is fantastic! So I


would first get familiar with using SQL in Mode and then you’ll be able to
go through the topics below!

Week 13: Basic SQL


Introduction to SQL

SELECT FROM statement

WHERE statement

ORDER BY statement

LIMIT statement

Week 14: LOGICAL and COMPARISON Operators


Logical Operators

Comparison Operators

Week 15: AGGREGATES


Aggregate Functions (COUNT, SUM, MIN/MAX, AVG)

GROUP BY clause

HAVING clause

Week 16: DISTINCT, CASE WHEN


DISTINCT

CASE WHEN

Week 17: JOINS and UNIONS


JOINs

UNIONs

Week 18: Subqueries and Common Table Expressions


Subqueries

Common Table Expressions (CTEs)

Week 19: String Manipulations


String Functions in SQL (LEFT/RIGHT, TRIM, STRPOS, SUBSTR,
CONCAT, UPPER/LOWER, etc…)

Week 20: Date-time manipulation


EXTRACT

DATE_ADD()

DATE_SUB()

DATE_DIFF()

See here for more functions (on the left of the webpage)

Week 21: Windows Functions


Windows Functions (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG,
LEAD, SUM, COUNT, AVG)

See here for advanced window functions.

Python and Programming


Why Python?
I started with Python, and I’ll probably stick with Python for the rest of
my life. It’s so far ahead in terms of open source contributions, and it’s
straightforward to learn. Feel free to go with R if you want, but I have no
opinions or advice to provide regarding R.

Week 22: Introduction to Python


Video: (0:00 to 1:03:10)

Week 23: List, Tuples, Functions, Conditional Statements,


Comparisons
Video: (1:03:10 to 2:00:37)

Week 24: Dictionaries, Loops, Comments


Video: (2:00:37 to 3:04:17)

Week 25: Try/Except, Reading & Writing files, Classes and Objects
Video: (3:04:17 to 4:20:43)

Week 26: Recursion


Video explanation of recursion

Course on recursion

Week 27: Binary Trees


Course on Binary Trees

Week 28: APIs and Anaconda


APIs for beginners

Getting Anaconda setup on your computer

Pandas
Why Pandas?
Arguably the most important library to know in Python is Pandas, which
is specifically meant for data manipulation and analysis.

Week 29: Getting and Knowing your data


Follow along and learn — YouTube video

Practice problem set #1

Practice problem set #2

Week 30: Filtering and Sorting


Follow along and learn — YouTube video

Practice problem set #1

Practice problem set #2

Week 31: Grouping


Follow along and learn — YouTube video

Practice problem set #1

Practice problem set #2

Week 32: Apply


Follow along and learn — YouTube video

Practice problem set #1

Week 33: Merge


Follow along and learn — YouTube video

Practice problem set #1

Practice problem set #2

Visualizing Data
Why Data Visualizations?
The ability to visualize data and insights is so important because it’s the
easiest way to communicate intricate information and a lot of
information at once. As a data scientist, you’re always selling yourself
and your ideas, whether your pitching a new project or convincing others
why your model should be productionalized — data visualizations are a
great tool to help you with that.

There are dozens of data visualization libraries out there, but I’m going
to focus on two: Matplotlib and Plotly.

Week 34: Data Visualizations with Matplotlib


Introduction to Matplotlib

3-D Visualizations in Matplotlib

Types of Data Visualizations in Matplotlib

Cheatsheet

Week 35: Data Visualizations with Plotly


Types of Visualizations in Plotly (beginner)
Types of Visualizations in Plotly (beginner and advanced)

Data Exploration and Preparation


Why Data Exploration and Preparation?

“Garbage in, garbage out”

The models that you create can only be as good as the data that you feed
into it. To understand what the state of your data is in, i.e. whether it’s
“good” or not, you have to explore the data and prepare the data.
Therefore, for the next four weeks, I’m going to provide several amazing
resources for you to go through and get a better understanding of what
data exploration and preparation entails.

Week 36: Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA) can be difficult because there’s no one
set way of doing it — but that’s also what keeps it exciting. Generally, you
want to…

Derive descriptive statistics (eg. central tendency)

Perform uni-variable analysis (distributions and spread)

Perform multi-variable analysis (scatterplots, correlation matrix,


predictive power score, etc…)

Check for missing data and check for outliers

Check out a beginner’s guide to EDA here.

Week 37: Data Preparation: Feature Imputation and Normalization


What is Feature Imputation?

6 ways to impute missing data

Normalization vs Standardization

Example of implementing normalization vs standardization

Week 38: Feature Engineering and Feature Selection


Feature Engineering Mini Course

Feature Encoding

An Introduction to Feature Selection

Week 39: Imbalanced Datasets


An Introduction to Imbalanced Classification Problems

The Right Way to Oversample in Predictive Modeling

Machine Learning
Why Machine Learning?
Everything that you’ve learned has led up to this point! Not only is
machine learning interesting and exciting, but it is also a skill that all data
scientists have. It’s true that modeling makes up a small portion of a data
scientist’s time, but it doesn’t take away from its importance.

Later in your career, you might notice that I left out some machine
learning algorithms, like K Nearest-Neighbors, Gradient Boost, and
CatBoost. This is completely intentional — if you can understand the
following machine learning concepts, you’ll have the skills to learn any
other machine learning algorithms in the future.

Week 40: Introduction to Machine Learning


Supervised vs Unsupervised, Continuous vs Discrete

Bias-variance tradeoff

Week 41: Linear Regression


Linear Models: Linear Regression

Linear Models: Multiple Regression

Mathematics behind linear regression

Week 42: Logistic Regression


Introduction to Logistic Regression

Part 1: Coefficients

Part 2: Maximum likelihood

Part 3: R-squared and P-value

Week 43: Regularization


Ridge Regression (L2)

Lasso Regression (L1)

Elastic Net Regression

Week 44: Decision Trees


Decision Trees Introduction

Feature Selection and Missing Date

Implementing a Decision Tree in Python

Week 45: Naïve Bayes


A Mathematical Explanation of Naïve Bayes

Naïve Bayes (StatQuest)

Week 46: Support Vector Machines


Intuition of Support Vector Machines

Support Vector Machines in Python

A mathematical explanation of Support Vector Machines

Week 47: Clustering


K-means clustering

Hierarchical clustering

Week 48: Principal Component Analysis


Principal Component Analysis (PCA) step-by-step

Another detailed explanation by Luis Serrano (I highly suggest you


watch both)

Mathematical explanation of PCA

Week 49: Bootstrap Sampling, Bagging, and Boosting


Bootstrap Sampling

Ensemble learning, Bagging, Boosting

Week 50: Random Forests and Other Boosted Trees


Random Forests pt.1

Random Forests pt.2

XGBoost — Regression

XGBoost — Classification

XGBoost — Mathematical Details

XGBoost in Python

Week 51: Model Evaluation Metrics


Evaluation Metrics with Python Code

Understanding the confusion matrix and how to implement it in


Python

Week 52: Data Science Project


If you feel comfortable with the materials above, you’re definitely ready
to start your own data science project! Just in case, I’ve provided three
ideas that you can use as inspiration to get started, but feel free to do
whatever you like.

Idea 1: SQL Case Study


Link to the case.

The objective of this case is to determine the cause for a drop in user
engagement for a social network called Yammer. Before diving into the
data, you should read the overview of what Yammer does here. There are
4 tables that you should work with.

The link to the case above will provide you with much more detail
pertaining to the problem, the data, and the questions that should be
answered.

Check out how I approached this case study here if you’d like guidance.

Skills You’ll Develop

SQL

Data Analysis

Data Visualization if you choose to visualize your insights.

Idea 2: Trustpilot Webscraper


Learning how to webscrape data is simple to learn and extremely useful,
especially when it comes to collecting data for personal projects.
Scraping a customer review website, like Trustpilot, is valuable for a
company as it allows them to understand review trends (getting better or
worse) and see what customers are saying via NLP.

First I would get familiar with how Trustpilot is organized, and decide
upon which kinds of businesses to analyze. Then I would take a look at
this walkthrough of how to scrape Trustpilot reviews.

Skills You’ll Develop

Writing Python Scripts

Data Wrangling

BeautifulSoup/Selenium (webscraping libraries)

Data Analysis

Take it further and apply NLP to extract insights from reviews.

Idea 3: Titanic Machine Learning Competition


In my opinion, there’s no better way of showing that you’re ready for a
data science job than to showcase your code through competitions.
Kaggle hosts a variety of competitions that involves building a model to
optimize a certain metric, one of them being the Titanic Machine
Learning Competition.

If you want to get some inspiration and guidance, check out this step-by-
step walkthrough of one of the solutions.

Skills You’ll Develop

Data Exploration and Cleaning with Pandas

Feature Engineering

Machine Learning Modelling

Thanks for Reading!

Be sure to subscribe here or to my exclusive newsletter to never miss


another article on data science guides, tricks and tips, life lessons,
and more!

I hope you found this useful! If you managed to get through this, you
should have a strong understanding of the fundamentals in Statistics,
Mathematics, SQL, Python/Pandas, and several machine learning
algorithms!

I hope this inspired you to continue learning too — there are so many
things that you can continue to explore like more advanced models (eg.
CatBoost), deep learning, experimental design, Bayesian modeling, cloud
architecture, and the list goes on.
If you like this and want to see future content, be sure to give me a follow
on Medium. And as always, I wish you the best in your data science
endeavors.

Not sure what to read next? I’ve picked another article for you:

Want to Be a Data Scientist? Don’t Start With Machine Learning.


The biggest misconception aspiring data scientists have
towardsdatascience.com

and another!

12 Data Science Projects for 12 Days of Christmas


Relevant and valuable data science projects that you can do in a day!
towardsdatascience.com

Terence Shin
If you enjoyed this, follow me on Medium for more

Sign up for my email list here!

Let’s connect on LinkedIn

Interested in collaborating? Check out my website

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.

Get this newsletter

Data Science Machine Learning Programming Education Artificial Intelligence

4.1K 20

More from Towards Data Science Follow

Your home for data science. A Medium publication sharing concepts, ideas and
codes.

Read more from Towards Data Science

More From Medium

What Makes A Successful A Word about Mr Jatan A Beginner’s Guide to Big Data Cleaning 101
Data Scientist? 5 Traits to ShaSkillnation.in Data Testing Jeffrey Ng in The Startup
Success Jacob Joe Nadeesha Liyanage in
Sara A. Metwalli in Towards Engineering at 99x
Data Science
5 Datasets to Inspire Your Question Answering on Airbnb Seattle: Weekend (Vaccine) Data explained:
Next Data Science Project Scientific Research Price Variation Review is “33% less infected”
Sara A. Metwalli in Towards
Papers matze
good enough?
Data Science Pradeep Dasigi in AI2 Blog AI Explained

About Write Help Legal

You might also like