2019-Is Data Privacy Real - Don't Bet On It

2/6/2020 Is Data Privacy Real?
Don’t Bet on It - Knowledge@Wharton
TECHNOLOGY
Is Data Privacy Real? Don’t Bet on It

Aug 23, 2019  North America
In 2009, Net ix was sued for releasing movie ratings data from half a million subscribers
who were identi ed only by unique ID numbers. The video streaming service divulged this
“anonymized” information to the public as part of its Net ix Prize contest, in which
participants were asked to use the data to develop a better content recommendation
algorithm. But researchers from the University of Texas showed that as few as six movie
ratings could be used to identify users. A closet lesbian sued Net ix, saying her anonymity
was compromised. The lawsuit was settled in 2010.
The Net ix case reveals a problem about which the public is just starting to learn, but that
data analysts and computer scientists have known for years. In anonymized datasets where
distinguishing characteristics of a person such as name and address have been deleted, even
a handful of seemingly innocuous information can lead to identi cation. When this data is
used to serve ads or personalize product recommendations, re-identi cation can be largely
harmless. The danger is that the data can be — and sometimes is — used to make
assumptions about future behavior or inferences about one’s private life — leading to
rejection for a loan, a job or worse.
https://knowledge.wharton.upenn.edu/article/data-privacy-real-dont-bet/ 1/9
2/6/2020 Is Data Privacy Real? Don’t Bet on It - Knowledge@Wharton
A research paper published in Nature Communications last month showed how easy re-
identi cation can be: A computer algorithm could identify 99.98% of Americans by knowing
as few as 15 attributes per person, not including names or other unique data. Even earlier, a
2012 study showed that just by tracking people’s Facebook ‘Likes,’ researchers could identify
if someone was Caucasian or African-American with a 95% certainty, male or female (93%),
or gay (88%); whether they drink (70%); or if they used drugs (65%).
This is not news to people in the industry — but it is to the public. “Most people don’t realize
that even if personal information is stripped away or is not collected directly, it’s often
possible to link certain information with a person’s identity by correlating the information
with other datasets,” says Kevin Werbach, Wharton legal studies and business ethics
professor and author of the book, The Blockchain and the New Architecture of Trust. “It’s a
challenging issue because there are so many di erent kinds of uses data could be put to.”
Werbach is a faculty a liate of the Warren Center for Network and Data Sciences, a research
center of Penn faculty who study innovation in interconnected social, economic and
technological systems.
For example, telecom companies routinely sell phone location information to data
aggregators, which in turn sell them to just about anyone, according to a January 2019 article
in Vice. These data buyers could include landlords screening potential renters, debt
collectors tracking deadbeats or a jealous boyfriend stalking a former ame. One data
aggregator was able to nd an individual’s full name and address as well as continuously
track the phone’s location. This case, the article says, shows “just how exposed mobile
networks and the data they generate are, leaving them open to surveillance by ordinary
citizens, stalkers, and criminals.”
That’s because the data you generate — whether from online activities or information held
by your employer, doctor, bank and others — is usually stored, sold and shared. “That data
is often packaged and sold to third parties or ad exchange networks,” says Michael Kearns,
computer and information science professor at Penn Engineering. A founding director of the
Warren Center, he is also co-author of the book, The Ethical Algorithm. “You are leaving
data trails all over the place in your daily life, whether by where you move physically in the
world or what you do online. All this is being tracked and stored.”
“[Even] if personal information is stripped away

or is not collected directly, it’s often possible to
link certain information with a person’s identity
by correlating the information with other
datasets.” –Kevin Werbach
Companies and other entities do try to keep datasets anonymous — the common practice is
to strip out unique information like the name and birthday. “It would seem like an e ective
approach and it used to work reasonably well,” says Kartik Hosanagar, Wharton professor of
operations, information and decisions, and author of the book, A Human’s Guide to Machine
Intelligence: How Algorithms Are Shaping Our Lives and How We Can Stay in Control. But
increasingly people have begun to recognize that this approach fails to o er protection,
especially if marketers cross-reference di erent datasets — say, social networking surveys
with census data. “If one has enough information about individuals … and [applies]
sophisticated machine learning algorithms, then it is possible to re-identify people,” he
notes.
Mathematically speaking, it is not that hard to re-identify people by using non-private

information, Kearns explains. Let’s say you drive a red car. A data analyst who knows that
only 10% of the population have a red car can disregard 90% of the people. Further assume
that for red-car drivers, half use Macs and the rest PCs. You use a Mac, so another 50% of the
group can be removed, and so on until you’re identi ed. “Each attribute slices away a big
chunk of the remaining candidates, so you can quickly get down to … a very small handful of
people,” he says.
That is precisely what happened at Net ix. Researchers were able to identify subscribers by
looking at what ratings they gave movies when they viewed the content — then cross-
referencing this data with movie ratings on the IMDB website, where people use their own
names. Net ix thought it did enough to keep identities private, but that wasn’t true.
“Relatively small amounts of idiosyncratic information are enough to uniquely identify
you,” says Aaron Roth, computer and information science professor at Penn Engineering
and co-author with Kearns of The Ethical Algorithm. “That’s the fundamental problem with
this approach of anonymizing datasets. Basically, it doesn’t work.”
Millions of Data Points
People generate millions of points of information about themselves all the time. You create
data every time you push the ‘Like’ button on Facebook, search for something on Google,
make a purchase on Amazon, watch a show on Net ix, send a text on your mobile phone or
do a transaction with your bank, insurer, retail store, credit card company or hotel. There’s
also medical data, property records, nancial and tax information, criminal histories, and
the like. “You’re generating data all the time when you’re using the internet, and this can
very quickly add up to a lot of features about you,” Roth says.
This dataset looks like an Excel spreadsheet where rows and columns correspond to di erent
points of information, Roth notes. For example, the rows in the Net ix dataset represented
500,000 subscribers while the columns comprised 18 million movie ratings. That might
seem like a lot of information, but to data scientists, “that’s not considered a large dataset,”
he says. Facebook and Google, for example, would have datasets with millions of people,
each of them having millions of attributes.
Roth explains that when organizations compile this information, they’re typically not
interested in individual people’s records. “Net ix wasn’t interested in what movies Suzy
watched. They were interested in statistical properties of the dataset … to predict what
movies someone would like.” The way to make predictions is to let a machine learning
algorithm go through the data and learn from it. After this training, the algorithm can use
what it learned to predict things about a person, such as what she would like to watch, eat,
buy or play. This data is invaluable to marketers, who use the information to try to serve
relevant digital ads to consumers.
Much of the data that is collected from people online is used for advertising, Kearns says. “A
great deal of the internet is monetized by advertising. Facebook and Google are entirely free
services that make money by advertising. All the data that they collect and ingest is largely in
service of improving their advertising predictions, because the better they can target ads to
you, the more money they make from their advertisers,” he adds. “This is the vast majority
of their revenue.”
“That’s the fundamental problem with this

approach of anonymizing datasets. Basically, it
doesn’t work.” –Aaron Roth
For companies like Amazon, advertising is not its prime money maker although it is starting
to focus on it, Kearns says. “They want all the data they can get about you to personalize
product recommendations. So a lot of this data is being used for a sensible business use that
is not necessarily at odds with consumers’ interests. We all prefer to have better
recommendations than worse recommendations. The problem arises when there’s mission
creep.” That’s when data meant to be used for one thing is also used for another. “The data
that’s useful for targeting advertising to you is also useful when I want to know whether to
give you a loan,” he says.
Another danger arises when this data is used to pro le you. “It’s not just that I know 15 facts
about you and that I uniquely identi ed you. It lets me predict a whole bunch of things about
you other than those 15 features,” Kearns says. You might say, “‘I don’t really care if
Facebook knows this about me, or Google knows that about me or Amazon knows this about
me,’” he says. “Actually, those innocuous facts may serve to identify you uniquely and they
might also let unwanted inferences be made about you.” An average of 170 Facebook ‘Likes’
per person was enough to predict with high accuracy one’s sexual orientation in the 2012
study.
Algorithms also are not immune to mistakes, like their human counterparts. However, when
machines make mistakes, they do so at scale. “In the modern, data-driven era, when
statistical modeling and machine learning are used everywhere for everything, whether we
realize it or not, mistakes are being made all the time,” Kearns says. It’s one thing when the
mistake is in serving up a useless ad, it’s another thing when it leads to “systemic harms,”
he adds.
For example, if an algorithm stereotypes incarcerated convicts from certain ethnic groups as
being more likely to recommit crimes, it could recommend to a parole board not to release a
prisoner with that racial makeup. “That mistake has much bigger life implications than my
showing you the wrong ad on Facebook,” Kearns says. “Society is waking up to this. …
Scientists are waking up to it also and working on designing better models and algorithms
that won’t be perfect but will try to reduce the systemic harm.”
The challenge is to create equity mathematically. One approach is to create algorithms that
put di erent weights on various factors. In the example about which convicts will most likely
recommit crimes if they are given parole, one way to remove unfairness in targeting an
ethnic group is to add inequity to other groups. “Typically, if I want the false positive rates to
be equal among populations, I would have to probably increase it on one of those [other
ethnic groups],” Roth says. But it means the overall error rate will get worse, he adds. So it
comes down to tradeo s. “How much do I value equity compared to raw accuracy?”
“In the modern, data-driven era, when

statistical modeling and machine learning are
used everywhere for everything … mistakes are
being made all the time.” –Michael Kearns
Di erential Privacy
One promising technical solution is called “di erential privacy,” according to Roth. This
technique adds ‘noise’ to the dataset to make accurate re-identi cation a lot harder. Let’s
say an employer wants to conduct a poll in Philadelphia to see what fraction of the
population has ever used drugs. Using a typical anonymizing technique, people’s answers
are recorded but their names, ages, addresses and other unique information are hidden. But
if a company really wants to spot drug users, it could cross-reference this data with other
datasets to nd out.
With di erential privacy, people would still be asked the same questions, but their responses
would be “randomized,” Roth says. It’s like asking them to ip a coin rst before they
answer. If the coin comes up heads, they have to tell the truth. If it’s tails, they answer
randomly. The result of the coin ip will be hidden from researchers. “If you have used
drugs, half the time you tell the truth when the coin ip comes up heads, half the time [it’s a]
random answer” because a coin toss is a 50%-50% probability. Further looking at just
random answers (participants will be asked to toss a coin again to see whether to tell the
truth or lie), “half of that time [people are telling] the truth,” Roth says.
That means 75% of the time the answer is truthful and 25% of the time it’s a lie. But are you
among the 75% or 25%? The researcher doesn’t know because the coin toss was a secret.
“Now you have strong plausible deniability,” Roth says. “If someone gets a hold of my
spreadsheet and says you use drugs … you have pretty strong deniability because [a lie could]
have happened one-fourth of the time.” In algorithmic terms, researchers add “a little bit of
randomness but [can] still very accurately compute population level statistics” while
guaranteeing strong plausible deniability, he explains.
Going back to the drug survey example, “in the aggregate, I can still get a very accurate
answer about what fraction of people in Philadelphia have used drugs because I know these
numbers: three-fourths of the time people tell the truth and one-fourth of the time people
lie,” Roth continues. “So in aggregate, I can subtract the noise and get a very accurate
estimate of the average … without ever having collected information about individuals that
implicated them [since] everyone has strong plausible deniability.”
In recent years, companies such as Google, Apple, Microsoft and LinkedIn have started to
use this technique, Roth says. While the math has been around since 2006, “it’s only in the
last few years that it has made it in the real world” because it takes time to shift from
theoretical to practical, he says. In 2014, Google began using di erential privacy in the
collection of usage statistics on its Chrome web browser. Two years later, Apple did the same
with personal data collection on iPhones. In 2020, the U.S. government will deploy this
method in the Census survey.
“You can have two solutions, neither of which is

better than the other but one offers more
privacy but at the cost of higher error. The
other offers more accuracy, but at the cost of
less privacy.” –Aaron Roth
But there are tradeo s to this method as well. The main one is that the level of plausible
deniability is one of degrees. The stronger the plausible deniability for individuals, the less
accurate the results will be for the researcher. Roth likens it to a “knob you can turn … to set
this privacy parameter.” Society as a whole has to gure out where the right balance is
between privacy and research results. “It depends on what you value. You can have two
solutions, neither of which is better than the other but one o ers more privacy but at the
cost of higher error. The other o ers more accuracy, but at the cost of less privacy. You have
to decide which … you value more for your particular use case.”
Legal Protection
Today, the U.S. has a hodgepodge of regulations on data privacy. “There is no comprehensive
privacy law on the federal level,” Werbach says. California does have an extensive privacy
regulation modeled after the European Union’s stringent General Data Protection Regulation
(GDPR), but there isn’t a national law. Roth agrees. “It’s a patchwork of di erent laws in the
U.S. There’s no overarching privacy regulation. There’s one regulation for health, there’s one
for [student] records, there’s another for video rental records. [And] some areas are
unregulated.”
And what privacy laws the U.S. has need to be strengthened. Take the Health Insurance
Portability and Accountability Act (HIPAA), which is designed to keep health records private.
It requires that 18 personally identi able information (name, address, age, Social Security
number, email address and others) be hidden to use this dataset. “Under the safe harbor
provision, you can do whatever you want as long as you can redact these unique identi ers.
You can release the records,” Roth says. But “as we know, even from collections of attributes
that don’t seem to be personally identifying, we can recover personal identity.”
Roth also cites the Family Educational Rights and Privacy Act (FERPA), which protects
student records, and the Video Privacy Protection Act (VPPA), which keeps video rental
records private. The VPPA dates back to the 1980s when a journalist dug up the video rental
records of Supreme Court Justice nominee Robert Bork, he says. Afterwards, Congress passed
the act, which assesses penalties of $2,500 for every user record revealed. (The plainti of
the Net ix lawsuit alleged violations of the VPPA.)
“[More] and more, it’s seen that consent is not

enough as a protection.” –Kevin Werbach
Various privacy bills have been introduced in Congress in the aftermath of Facebook’s
Cambridge Analytica scandal. But Werbach points out that for protection to be robust,
regulations must go beyond “limiting speci c kinds of collections but thinking broadly
about data protection and what sorts of rights people should have in data collection.”
So far, “the U.S. approach has been mostly focused on consent — the idea that companies
should be transparent about what they’re doing and get the OK from people. [But] more and
more, it’s seen that consent is not enough as a protection,” Werbach adds. This consent is
usually buried in a company’s ‘Terms of Service’ agreement. However, this is “not
something an ordinary person is comfortable reading,” he says.
Refusing to agree to a company’s ‘Terms of Service’ also is not realistic for most people,
especially if they can get free use of a product or service such as Google search or Facebook.
So what can consumers do to protect themselves? “The one obvious solution is one that
nobody realistically will adopt — be extremely limited in your online activity. Don’t use
services that are big and sprawling and are collecting all sorts of data from you and maybe
using that data internally for things you don’t know about or giving that data to third parties
you’re unaware of,” Kearns says.
Werbach adds that consumers should be “attentive to what options you do have to say no, to
what choices you may have about how your information is collected and shared.” He says
many companies will let you opt out. But “at the end of the day, this is not a problem that can
be solved by end users. They don’t have the power to do this. We need a combination of legal
and regulatory oversight — and companies recognizing that it’s ultimately in their interest
to act responsibly,” Werbach says.
Until that happens, be afraid — be very afraid — of where your data breadcrumbs could lead.
All materials copyright of the Wharton School (http://www.wharton.upenn.edu/) of the University of

Pennsylvania (http://www.upenn.edu/).
Report accessibility issues and get help (https://accessibility.web-resources.upenn.edu/get-help)

2019-Is Data Privacy Real - Don't Bet On It

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019-Is Data Privacy Real - Don't Bet On It

Uploaded by

Copyright:

Available Formats

2/6/2020 Is Data Privacy Real?

Don’t Bet on It - Knowledge@Wharton

Is Data Privacy Real? Don’t Bet on It

“[Even] if personal information is stripped away

Mathematically speaking, it is not that hard to re-identify people by using non-private

Millions of Data Points

“That’s the fundamental problem with this

“In the modern, data-driven era, when

“You can have two solutions, neither of which is

“[More] and more, it’s seen that consent is not

All materials copyright of the Wharton School (http://www.wharton.upenn.edu/) of the University of

You might also like