Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Two Things…

1100983

Pre-semester Survey Spotify Playlist

● Take a minute to fill out the ● We created this collaborative Spotify


pre-semester survey now if you playlist to be played before class.
haven’t already! Share your favorite songs with the
○ We will assign you to a discussion section class!
based on your response tonight

1
1100983

LECTURE 1

Course Overview
An overview of data science, Data 100/200, and the data science lifecycle.

Data 100/Data 200, Fall 2023 @ UC Berkeley


Narges Norouzi and Fernando Pérez
2
1100983

• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Data Science Lifecycle
• Demo

Roadmap
Lecture 01, Data 100 Fall 2023

3
1100983

Join at slido.com
#1100983

Click Present with Slido or install our Chrome extension to display joining

instructions for participants while presenting.
1100983

What emoji best describes


your mood today?

Click Present with Slido or install our Chrome extension to activate this poll

while presenting.
Slido Q&A

Use the QR code on any slide to post questions to slido. 1100983

This is the best place to clarify course content.

6
Intro - Narges Norouzi

● Ph.D. from University of Toronto (2017) in CE from Artificial Perception Lab. 1100983

○ Thesis focus: machine learning and statistical signal processing in


biomedical applications.
● Research: ML with applications in biomedical and educational domain
○ Bioelectronic closed loop wound healing
○ Intelligent AI tutoring as classroom assistant
○ Find out more at: https://nargesnorouzi.me/
● Lecturer at University of Toronto (2016-2017).
● Teaching Professor at UC Santa Cruz (2017-2022).
● Teaching Professor at UC Berkeley (2022-now).
● Second time teaching this class!
○ Previously taught graduate and undergraduate Artificial Intelligence, Machine
Learning, Deep Learning, and Algorithms courses
● This semester, I’m:
○ Leading a small research group focused on AI education
○ Starting several granted projects
7
Intros - Fernando Pérez

● Faculty in Statistics since 2017 (419 Evans Hall). 1100983

● @ Berkeley since 2008, formerly building tools for neuroscience.


● Background: Physics PhD, then Applied Mathematics (numerical methods).
● Big focus of my career: open source tools for science, especially in Python.
○ 2001: Created IPython as a graduate student.
○ 2012: Co-founded NumFOCUS, non-profit; open source for Science & Education.
○ 2014: Co-founded IPython's evolution, Project Jupyter.
○ 2020: Co-founded 2i2c.org, non-profit; interactive computing in S & E.
● Research interests
○ Interactive computational tools for scientific discovery and education.
■ Collaborative, open and reproducible research.
○ Data Science and Machine Learning in the physical sciences, esp. geoscience.
○ Applications in Cryosphere science (ice!), climate change & the environment.

8
1100983

• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Data Science Lifecycle
What is Data • Demo

Science?
Lecture 01, Data 100 Fall 2023

9
1100983

Some examples of Data Science?

10
Recommendation Systems 1100983

11
WIFIRE (UCSD) - Wildfire modeling and management
1100983

https://wifire.ucsd.edu

12
2019: First Image of a Black Hole 1100983

Katie Bouman
MIT/Caltech Talk Video: https://youtu.be/TSgpIiktkwc
2019: First Image of a Black Hole 1100983
1100983

Why Data Science Matters

15
Data Science Enhances Critical Thinking

1100983
The world is complicated! Decisions are hard.
Data is used everywhere to answer hard questions and make tough decisions:
● Science
● Medicine
● Social science
● Engineering
● Sports
Claims about data come up in discussing almost any important issue:
● Instead of "Aquinas says," now it’s "the data says."
● It is usually not easy to tell what the data "says"
● Empower yourself to participate in the arguments that shape your life and your society

16
Data Is Changing the World

1100983

AI…?

From Joey Gonzalez. 17


The Darker Side of Data Science?

Obscuring complex decisions: 1100983

● Mortgage-backed securities → market crash


● Teaching scores & job advancement
Reinforcing historical trends and biases:
● Hiring based on previous hiring data
● Recidivism and racially biased sentencing
● Social media, news, and politics
We will discuss the ethics of data science throughout the class!

NPR author interview


with Cathy O’Neil 18
But…I Am Optimistic!

Knowledge is empowering. 1100983

Data science offers immense potential to address challenging problems facing society.

The future is in your hands, and I believe:

You will use your knowledge for good.

…I am thrilled to teach Data 100 :-)

19
My Primary Goal for You in This Course

The world is complicated! Decisions are hard. 1100983

Data science is a fundamentally human-centered field that facilitates decision-making by


quantitatively balancing tradeoffs.

● To quantify things reliably we must:


○ Find relevant data; After this course, you should
○ Recognize its limitations; be able to take data and
○ Ask the right questions; produce useful insights on
○ Make reasonable assumptions; the world’s most challenging
○ Conduct an appropriate analysis; and
and ambiguous problems.
○ Synthesize and explain our insights.

● Apply critical thinking and skepticism at every step; and

● Consider how our decisions affect others.

20
Data Science Enhances Critical Thinking

1100983
The world is complicated! Decisions are hard.
Data science is a fundamentally human-centered field that facilitates decision-making by
quantitatively balancing tradeoffs.

● To quantify things reliably we must:


○ Find relevant data;
○ Recognize its limitations;
○ Ask the right questions;
○ Make reasonable assumptions;
○ Conduct an appropriate analysis; and
○ Synthesize and explain our insights.

● Apply critical thinking and skepticism at every step; and

● Consider how our decisions affect others.


21
What Is Data 100?

1100983

PRINCIPLES AND TECHNIQUES OF DATA SCIENCE

22
Data Science Is a Fundamentally Interdisciplinary Field

1100983

Data Science is the application of data centric,


computational, and inferential thinking to:
● Understand the world (science).
● Solve problems (engineering).

Joey Gonzalez
(co-creator of this course)

23
Data Science Venn Diagram

1100983

by Drew Conway in 2010 (link) 24


Data Science in Industry

1100983

The major tasks that data


scientists say they work on
regularly.
Self-reported. Based on the results of the
2016 Data Science Salary Survey. 25
Data Science Requires Engineering and Scientific Insight

Good data analysis is not: 1100983

● Simple application of a statistics recipe.


● Simple application of statistical software.

There are many tools out there for data science, but they are merely tools.
● They don’t do any of the important thinking!

“The purpose of computing is insight, not numbers.”


R. Hamming. Numerical Methods for Scientists and Engineers (1962).
26
Example Questions in Data Science

Some (broad) questions we might try to answer with data science: 1100983

● What show should we recommend to our user to watch?


● In which markets should we focus our advertising campaign?
● Is the use of the COMPAS algorithm for prison sentencing fair?
● Should I send my kids to daycare?
● Is the world getting better or worse?
● What areas of the world are at higher risks for climate change impact in 10 years? 20?
● Where should we put docking ports for our bikes?
● What should we eat to avoid dying early of heart disease?
● Do immigrants from poor countries have a positive or negative impact on the economy?

27
1100983

• Intros
• What is data science?
• What will you learn in this class?
What Will You • Course overview
• Lots of important details
Learn in This •

Data Science Lifecycle
Demo

Class?
Lecture 01, Data 100 Fall 2023

28
What Is Data 100?

1100983

PRINCIPLES AND TECHNIQUES OF DATA SCIENCE

29
Course Goals

1100983

Prepare students for advanced Berkeley courses in data


Prepare management, machine learning, and statistics, by providing
the necessary foundation and context.

Enable students to start careers as data scientists by


Enable providing experience working with real-world data, tools,
and techniques.

Empower students to apply computational and inferential


Empower thinking to address real-world problems.

30
Tentative List of Topics to be Covered in Data 100

● Pandas and NumPy ● Model design and loss formulation 1100983


● Relational Databases & SQL ● Linear Regression
● Exploratory Data Analysis ● Feature Engineering
● Regular Expressions ● Regularization, Bias-Variance Tradeoff,
● Visualization Cross-Validation
○ matplotlib ● Gradient Descent
○ Seaborn ● Data science in the physical world
○ plotly ● Logistic Regression
● Sampling ● Clustering
● Probability and random variables ● PCA

31
1100983

Which of these topics


excites you the most?
Please rank.

Click Present with Slido or install our Chrome extension to activate this poll

while presenting.
Prerequisites

Official prerequisites for this course: 1100983

● Completion of Data 8.
● Completion of CS 61A, Data C88C, or Engineering 7.
● Co-enrollment in EE 16A or Math 54 or Stat 89A.

The prereqs are being strictly enforced! We will not be teaching:


● How to use Python.
● How to use Jupyter notebooks.
● Inference from Data 8.
● Linear algebra (though we will review this topic to a greater degree since linear algebra is
a corequisite, not prerequisite).

Homework 1 and Lab 1 will help calibrate your background.


● For Homework 1, the Data 8 textbook will be helpful.
33
1100983

• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Data Science Lifecycle
• Demo

Course Overview
Lecture 01, Data 100 Fall 2023

34
1100983

Staff

35
Leads

1100983

Head TAs

Rohan Jha Shiny Weng Lillian Weng Milad Shafaie Mihran Miroyan Rahul Shah

Sean Wei Tina Chen Yash Dave Yuerou Tang Matthew Shen Pragnay Nevatia

Leads run one aspect of the course (examples include: infrastructure, student support,
logistics, etc.). Contact info: ds100.org/fa23/staff. 36
UCS2s

1100983

Shreya Gupta Srikar Munukutla Stephanie Xu Angela Feng

Brandon (Dun-Ming)
Celine Choi Charlie Ji Ishani Gupta
Huang

UCS2s teach discussions and assist in a wide variety of tasks (examples include:
developing content, exam prep, etc.). Contact info: ds100.org/fa23/staff. 37
UCS1s

1100983

Aditi Somayajula Aneesh Durai Anya Agarwal Arman Kazmi Boyu Fan Chetan Goenka

Geovanni Mojica Harshit Jain Ian Dong Irene Nguyen Jessica Hsu Lisa Liu

UCS1s help review and grade our assignments, answer lecture and Ed questions, host office
hours, and get trained for future semesters! Contact info: ds100.org/fa23/staff. 38
UCS1s

1100983

Madhav Varshney Manas Khatore Minoli Daigavane Nikhil Reddy Ollie Petrick Prabhleen Kaur

Qijun Li Sarika Pasumarthy Victor Shi Yiqing Zhang Sakshi Kolli

UCS1s help review and grade our assignments, answer lecture and Ed questions, host office
hours, and get trained for future semesters! Contact info: ds100.org/fa23/staff. 39
1100983

Course Logistics
Content and workflow

40
Syllabus Walkthrough

1100983

ds100.org/fa23/syllabus/

41
1100983

Course Websites /
Platforms

42
Online Platforms

Course website (ds100.org/fa23) 1100983

● Where all lectures, assignments, and discussions are posted.


DataHub (data100.datahub.berkeley.edu)
● Where you will work on all assignments (links on the course website automatically take you here).
Ed (https://edstem.org/us/courses/42444 )
● A place to ask and answer questions about assignments and concepts.
● Where all announcements are posted (exam logistics, new assignment released, etc).
Gradescope (gradescope.com, by invitation)
● Where all assignments are submitted, and where all of your grades in this course will live.
Lecture Notes (https://ds100.org/course-notes/)
● A summary of the highlights of topics covered in each lecture.
Textbook (www.textbook.ds100.org)
● Supplemental reading (not 100% synchronized with the class schedule).

43
Programming Environment for our Course: JupyterLab

1100983

44
Learning Advanced JupyterLab

JupyterLab offers notebooks and more tools for data science. 1100983

We’ll be accessing JupyterLab using DataHub (data100.datahub.berkeley.edu).

Resources for learning fancier JupyterLab functionality:


● A quickest intro is this great 2-minute overview by Serena Bonaretti.
○ Note: Unlike Serena’s example, in our course we’re using JupyterLab notebooks hosted on the
internet, not on your own local computer.
● The interface overview from the official docs has more details and short, embedded videos.
● A more detailed discussion from a bio/data angle: ~45 minute video.
● Full ~3h in-depth tutorial is available from the core team.

45
1100983

• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Data Science Lifecycle
Data Science • Demo

Lifecycle
Lecture 01, Data 100 Fall 2023

46
Data Science Lifecycle

The “data science lifecycle” you will see out in the wild may be slightly different than 1100983
the one we teach you, but the core ideas are all the same.

47
Data Science Lifecycle
The data science lifecycle is a 1100983
high-level description of the data
science workflow.

Note the two distinct entry points! Ask a Obtain


Question Data

Understand Understand
the World the Data

Reports, Decisions,
and Solutions
48
1. Question/Problem Formulation

● What do we want to know? 1100983

● What problems are we trying to


solve?
● What hypotheses do we want to Ask a Obtain
test? Question Data
● What are our metrics for success?

Understand Understand
the World the Data

Reports, Decisions,
and Solutions
49
2. Data Acquisition and Cleaning

● What data do we have and what 1100983

data do we need?
● How will we sample more data?
● Is our data representative of the Ask a Obtain
population we want to study? Question Data

Understand Understand
the World the Data

Reports, Decisions,
and Solutions
50
3. Exploratory Data Analysis & Visualization

● How is our data organized and 1100983

what does it contain?


● Do we already have relevant data?
● What are the biases, anomalies, or Ask a Obtain
other issues with the data? Question Data
● How do we transform the data to
enable effective analysis?

Understand Understand
the World the Data

Reports, Decisions,
and Solutions
51
4. Prediction and Inference

● What does the data say about the 1100983

world?
● Does it answer our questions or
accurately solve the problem? Ask a Obtain
● How robust are our conclusions Question Data
and can we trust the predictions?

Understand Understand
the World the Data

Reports, Decisions,
and Solutions
52
1100983

Why do you think the data


science lifecycle is
iterative?

Click Present with Slido or install our Chrome extension to activate this poll

while presenting.
1100983

• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Data Science Lifecycle
Demo: The Data • Demo

Science Lifecycle
Available on the course website:
Lecture 01, Data 100 Fall 2023 https://ds100.org/fa23/lecture/lec01

54
Ask a Question: Who are you?

1100983

Ask a Obtain
Question Data
Demo Slides

Understand Understand
the World the Data
55
Reports, Decisions
Data Acquisition and Cleaning

1100983

Ask a Obtain
Question Data
Demo Slides

Understand Understand
the World the Data
56
Reports, Decisions
EDA and Visualization

1100983

Let’s understand what our data tells us, and


let’s clean the data while we’re at it.

Ask a Obtain
Question Data
Demo Slides

Understand Understand
the World the Data
57
Reports, Decisions
EDA and Visualization

1100983

Population: Data 100 students, Fall 2023


Some sub-questions:

1. How many students are in the class?

2. What are your majors?

3. What year are you?
4. How did your major enrollment trend
change over time?

Ask a Obtain
Question Data
Demo Slides

Understand Understand
the World the Data
58
Reports, Decisions
Again, but for Campus Data

What does each row/column represent? 1100983

Ask a Obtain
Question Data
Demo Slides

Understand Understand
the World the Data
59
Reports, Decisions
What’s the Point of This Demo?

We make assumptions in data science 1100983

● Is the data representative:


○ Of the question being asked
○ Of the world and its implications
● Beliefs/backgrounds of data collectors
● Beliefs/backgrounds of data analysts
● Beliefs/backgrounds of the population

Data Science does not and cannot live in a


Demo Slides theoretical vacuum. Data Science is a
human-centered technical practice.

60
Two Things…

1100983

Pre-semester Survey Spotify Playlist

● Take a minute to fill out the ● We created this collaborative Spotify


pre-semester survey now if you playlist to be played before class.
haven’t already! Share your favorite songs with the
○ We will assign you to a discussion section class!
based on your response tonight

61
1100983

See you soon!

62
1100983

LECTURE 1

Course Overview
Content credit: Acknowledgments

63

You might also like