Professional Documents
Culture Documents
A Complete 52 Week Course To Become A Data Scientist in 2021
A Complete 52 Week Course To Become A Data Scientist in 2021
Follow 619K Followers · Editors' Picks Features Deep Dives Grow Contribute About
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Introduction
If you want to be a data scientist but haven’t yet made the commitment,
now is the time.
And so, I’m presenting to you a complete 52 weeks curriculum that you
can do in 2021 as a new years resolution! It is crammed and it will be a
grind, but it will be worth it.
Immediately, you’ll notice that this guide does not start with machine
learning, and I have good reasons for that. If you want to learn more
about why machine learning comes in the later weeks, check out my
article:
This will not cover everything that you need to know to be a fully
equipped data scientist. That being said, this will cover what I believe are
the most fundamental skills of a data scientist.
This assumes that you already know differential calculus since we all
learned it in high school
This curriculum will not include anything related to deep learning. Deep
learning deserves its own 52 weeks on its own — it would be a disservice if
I tried to squeeze it in!
Course Structure
1. Statistics and Probability (Week 1 to Week 6)
Week 2: Probability
Theoretical probability
Sample spaces
Set operations
Addition rule
Permutations
Combinations
Combinatorics
More videos
Mathematics
Why Mathematics?
Like statistics, many data science concepts build on fundamental
mathematical concepts.
Transpose of a Matrix
At its core, SQL is used to extract (or query) specific data from a
database, so that you can do things like analyze the data, visualize the
data, model the data, etc. Therefore, developing strong SQL skills will
allow you to take your analyses, visualizations, and modeling to the next
level because you will be able to extract and manipulate the data in
advanced ways.
WHERE statement
ORDER BY statement
LIMIT statement
Comparison Operators
GROUP BY clause
HAVING clause
CASE WHEN
UNIONs
DATE_ADD()
DATE_SUB()
DATE_DIFF()
See here for more functions (on the left of the webpage)
Week 25: Try/Except, Reading & Writing files, Classes and Objects
Video: (3:04:17 to 4:20:43)
Course on recursion
Pandas
Why Pandas?
Arguably the most important library to know in Python is Pandas, which
is specifically meant for data manipulation and analysis.
Visualizing Data
Why Data Visualizations?
The ability to visualize data and insights is so important because it’s the
easiest way to communicate intricate information and a lot of
information at once. As a data scientist, you’re always selling yourself
and your ideas, whether your pitching a new project or convincing others
why your model should be productionalized — data visualizations are a
great tool to help you with that.
There are dozens of data visualization libraries out there, but I’m going
to focus on two: Matplotlib and Plotly.
Cheatsheet
The models that you create can only be as good as the data that you feed
into it. To understand what the state of your data is in, i.e. whether it’s
“good” or not, you have to explore the data and prepare the data.
Therefore, for the next four weeks, I’m going to provide several amazing
resources for you to go through and get a better understanding of what
data exploration and preparation entails.
Normalization vs Standardization
Feature Encoding
Machine Learning
Why Machine Learning?
Everything that you’ve learned has led up to this point! Not only is
machine learning interesting and exciting, but it is also a skill that all data
scientists have. It’s true that modeling makes up a small portion of a data
scientist’s time, but it doesn’t take away from its importance.
Later in your career, you might notice that I left out some machine
learning algorithms, like K Nearest-Neighbors, Gradient Boost, and
CatBoost. This is completely intentional — if you can understand the
following machine learning concepts, you’ll have the skills to learn any
other machine learning algorithms in the future.
Bias-variance tradeoff
Part 1: Coefficients
Hierarchical clustering
XGBoost — Regression
XGBoost — Classification
XGBoost in Python
The objective of this case is to determine the cause for a drop in user
engagement for a social network called Yammer. Before diving into the
data, you should read the overview of what Yammer does here. There are
4 tables that you should work with.
The link to the case above will provide you with much more detail
pertaining to the problem, the data, and the questions that should be
answered.
Check out how I approached this case study here if you’d like guidance.
SQL
Data Analysis
First I would get familiar with how Trustpilot is organized, and decide
upon which kinds of businesses to analyze. Then I would take a look at
this walkthrough of how to scrape Trustpilot reviews.
Data Wrangling
Data Analysis
If you want to get some inspiration and guidance, check out this step-by-
step walkthrough of one of the solutions.
Feature Engineering
I hope you found this useful! If you managed to get through this, you
should have a strong understanding of the fundamentals in Statistics,
Mathematics, SQL, Python/Pandas, and several machine learning
algorithms!
I hope this inspired you to continue learning too — there are so many
things that you can continue to explore like more advanced models (eg.
CatBoost), deep learning, experimental design, Bayesian modeling, cloud
architecture, and the list goes on.
If you like this and want to see future content, be sure to give me a follow
on Medium. And as always, I wish you the best in your data science
endeavors.
Not sure what to read next? I’ve picked another article for you:
and another!
Terence Shin
If you enjoyed this, follow me on Medium for more
Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.
4.1K 20
Your home for data science. A Medium publication sharing concepts, ideas and
codes.
What Makes A Successful A Word about Mr Jatan A Beginner’s Guide to Big Data Cleaning 101
Data Scientist? 5 Traits to ShaSkillnation.in Data Testing Jeffrey Ng in The Startup
Success Jacob Joe Nadeesha Liyanage in
Sara A. Metwalli in Towards Engineering at 99x
Data Science
5 Datasets to Inspire Your Question Answering on Airbnb Seattle: Weekend (Vaccine) Data explained:
Next Data Science Project Scientific Research Price Variation Review is “33% less infected”
Sara A. Metwalli in Towards
Papers matze
good enough?
Data Science Pradeep Dasigi in AI2 Blog AI Explained