Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Introduction to Data Science

Data Ethics / Conclusion

2022 Spring
Jae-Gil Lee
이재길
Chapter 26.
Data Ethics 2
What Is Data Ethics?
● With the use of data comes the misuse of data.
● Recently, this idea has been reified (구체화되다) as “data ethics” and has
featured somewhat prominently in the news.
○ Example: in the 2016 election, a company called Cambridge Analytica improperly accessed
Facebook data and used that for political ad targeting. (See the next slide.)
● Data ethics is a framework for thinking about right and wrong behavior
involving data.

3
Facebook–Cambridge Analytica Data Scandal
● It concerned the obtaining of the personal data of millions of Facebook users
without their consent by British consulting firm Cambridge Analytica,
predominantly to be used for political advertising.
● The data was collected through an app called “This Is Your Digital Life”.
● The app consisted of a series of questions to build psychological profiles on
users, and collected the personal data of the users’ Facebook friends via
Facebook’s Open Graph platform. The app harvested the data of up to 87
million Facebook profiles.
● Cambridge Analytica used the data to provide analytical assistance to the 2016
presidential campaigns of Ted Cruz and Donald Trump.

4
Facebook–Cambridge Analytica Data Scandal

5
Facebook–Cambridge Analytica Data Scandal

Mark Zuckerberg testifies before on Capitol Hill in 2018 following the privacy scandal.
Source: https://www.theguardian.com/technology/2019/mar/17/the-cambridge-analytica-scandal-changed-the-world-but-it-didnt-change-facebook
6
Should I Care About Data Ethics?
● You should care about ethics whatever your job.
● Perhaps what’s different about technology jobs is that technology scales, and
that decisions made by individuals working on technology problems (whether
data-related or not) have potentially wide-reaching effects.
○ A tiny change to a news discovery algorithm could be the difference between millions of people
reading an article and no one reading it.
○ A single flawed algorithm for granting parole (가석방) that’s used all over the country
systematically affects millions of people, whereas a flawed-in-its-own-way parole board (가석방
심의 위원회) affects only the people who come before it.
● In general, you should care about what effect your work has on the world.

7
Building Bad Data Products
● Some “data ethics” issues are the result of building bad products.
● Microsoft released a chat bot named Tay that parroted back things tweeted to
it, which the internet quickly discovered enabled them to get Tay to tweet all
sorts of offensive things. → racism

● Google Photos at one point used an image recognition algorithm that would
sometimes classify pictures of black people as “gorillas.”
○ Here it seems likely the problem is some combination of bad training data, model inaccuracy,
and the gross offensiveness of the mistake (if the model had occasionally categorized
mailboxes as fire trucks, probably no one would have cared).

8
Building Bad Data Products

Collecting KaKaoTalk messages


without explicit approval for
training a chatbot service

'연애의 과학' 앱의 약관에는 채팅 내용이 신규 서비스를 위해 사용된다고 고지했지만 연애의 과학


서비스가 아닌 전혀 다른 채팅봇을 만드는데 사용되었다는 점이 논란이 되고 있다. 수집된
개인정보의 사용범위에 대해 확실히 고지를 하지 않았기에 안일한 방식이었다는 비판을 받고 있다.
심지어 대화 분석 서비스는 유료 서비스임에도 불구하고 멋대로 대화 내용을 사용해 유저들이 더욱
분노하고 있다. (출처: 나무위키)
9
ILUDA Chatbot Issue in Korea

Because of LGBTQIA, race, and gender discrimination, ILUDA was stopped 3 weeks after the launch

10
Trading Off Accuracy and Fairness

● Is your model unfair?

쏇 o
A 繼
of
Group Prediction People Actions %
s
Prediction
v50
People Actions % A Unlikely 100 20 20%

Unlikely 125 25 20% A Likely 25 15 60%

Likely 125 75 60% B Unlikely 25 5 20%

B Likely 100 60 60%

Fair Not
Your model classifies 80% of group A as “unlikely” but 80%Fair
of group B as “likely.”

ModelIS Biased

11
A Interpretability reason이중요하면 prediction 종D해로
반면reason상관때 overall
Performance 가중요
● In some circumstances (possibly for legal reasons or if your predictions are
somehow life-changing) you might prefer a model that performs worse but
whose predictions can be explained. In others, you might just want the model
that predicts best.
?
explainable AI

12
Recommendations
● YouTube makes money through advertising and (presumably) wants to
recommend videos that you are more likely to watch, so that they can show
you more advertisements. However, it turns out that people like to watch
videos about conspiracy theories, which tend to feature in the
recommendations.
● Does YouTube have an obligation not to recommend conspiracy videos? Even
if that’s what lots of people seem to want to watch?
dilemma
● ...

13
Biased Data
● More commonly, word vectors are based on some combination of Google
News articles, Wikipedia, books, and crawled web pages. This means that
they’ll learn whatever distributional patterns are present in those sources.
● For example, if the majority of news articles about software engineers are
about male software engineers, the learned vector for “software” might lie
closer to vectors for other “male” words than to the vectors for “female” words.
● ...

14
Data Protection
● In your SNS, you know what technologies they like, who their data scientist
friends are, where they work, how much they earn, how much time they spend
on the site, which job postings they click on, and so forth.
● The VP of Monetization wants to sell this data to advertisers, who are eager to
market their various “big data” solutions to your users. The Chief Scientist
wants to share this data with academic researchers, who are keen to publish
papers about who becomes a data scientist. The VP of Electioneering has
plans to provide this data to political campaigns, most of whom are eager to
recruit their own data science organizations.
● One thinks it’s wrong to hand the data over to advertisers; another worries that
academics can’t be trusted to safeguard the data responsibly.
● … 15
Conclusion

16
In This Semester ...
● Intuition of Data Science
● Data Preprocessing
● Statistical Experiments
● Machine Learning Basics & Gradient Descent
● Regression & Logistic Regression
● kNN, Naive Bayes, Decision Trees
● Neural Networks
● Deep Learning Basics
● Unsupervised Learning: Clustering KMearns Hierarchical Algorithm
● Application: Recommender Systems
● Big Data: MapReduce
● Data Ethics
17
In This Semester ...
● Python code (from scratch not relying on scikit-learn) + theoretical
preliminaries!

18
In This Semester …
● Kaggle competition for detecting fake import declarations by the courtesy of
Korea Customs Service

Congratulations!

● Grand prize(최우수상): Hyeontae Song 0.92638


● Runner-up prize(우수상): Jaehui Hwang 0.92626
● Runner-up prize(우수상): 조현준 0.92626

The recipients will be recommended to Korea Customs Service and be contacted


for the award ceremony and presentation.

19
Thank You!
Any Questions? 20

You might also like