Big Data and Data Ethics

Introductory Talk
DR. VISHAL GOYAL

PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE.
PUNJABI UNIVERSITY PATIALA
Major Breakthroughs in AI in 21st
Century
∙ Enabled by Big Data and Machine Learning
Language Translation
Google Translate: Any Language to Any Language
∙ Speech to Speech Dialog

Siri, Cortana, Alexa
∙ Autonomous Vehicles
CMU, Stanford, Google, Tesla
∙ Deep Question Answering

IBM’s Watson
∙ Robo Soccer
∙ World Champion Poker
CMU Libratus
No Limit Texas Hold’em Poker
Major Breakthroughs in AI of the 20th Century
Enabled by Brute-force, Heuristics, Human Coding
of Rules and Knowledge, and Simple Machine
Learning (Pattern Recognition)
 World Champion Chess Machine
IBM Deep Blue
 Mathematical Discovery
Proof Checkers
 Accident Avoiding Car
CMU: No Hands Across America
 Robotics
Manufacturing Automation
Disaster Rescue Robots
 Speech Recognition Systems
Dictation Machine
 Computer Vision and Image Processing
Medical Image Processing
 Expert Systems
Rule Based Systems
Knowledge Based Systems
Big Data is a phrase used to mean a massive volume of
both structured and unstructured data that is so large it is difficult to
process using traditional database and software techniques. In most
enterprise scenarios, the volume of data is too big or it moves too fast
or it exceeds current processing capacity.
Evolution of Data
and its Applications
Evolution of Big Data
 By 2000, Transaction and Data Warehousing systems were matured
 Relational databases were able to handle most of industry requirements
 Data Warehouse were able to handle reporting requirements
 Lots of innovation happened from Oracle, Microsoft and Teradata
Evolution of data – Faster Internet
• Faster internet accelerated e-Commerce that resulted in tremendous increase in transaction data
volume
• Faster Internet also resulted in generation of lot of unstructured data like emails, spreadsheet and
documents
• This brought newer challenges in handling of data
• Newer technologies came to handle newer type of data.
• Search Engine like Autonomy, Fast, GSA (Google Search Appliance) etc handled unstructured
data load
• Engineered Machines and Data Warehouse Appliances took Data Warehouse to next level and
were able to handle multiple TB of data
Evolution of Data – Social Media
Between 2003-2006, various social media platforms were launched that
accelerated the generation of unstructured data
2003 2004 2005 2006

Evolution of data – Game changer
• In 2007, Apple launched iPhone that started a revolution in smartphones and changed
the whole society in next few years.
• Before launch of iPhone, there were smartphones like Blackberry but scope was
limited to enterprise use only.
• Launch of iPhone practically gave a very powerful computer in everybody’s hand
Evolution of Big Data – Smart Phones
Source: https://www.businessinsider.in/There-Was-Something-Different-About-The-Vatican-Crowd-In-2005/articleshow/21257038.cms
Evolution of Big Data – Smart Phones
Source: https://siliconangle.com/blog/2016/02/04/data-rich-more-people-have-access-to-the-internet-than-water/
Evolution of Data – Smart Phone Data Usage
Source: https://www.statista.com/statistics/752731/worldwide-average-monthly-smartphone-cellular-data-usage/
Data Size Projections – How Big is ZettaBytes
Bytes(8 bits)
Kilobyte (1000 bytes)
Megabyte (1 000 000 bytes)
Gigabyte (1 000 000 000 bytes)
Terabyte (1 000 000 000 000 bytes)
Petabyte (1 000 000 000 000 000 bytes)
Exabyte (1 000 000 000 000 000 000 bytes)
Zettabyte (1 000 000 000 000 000 000 000 bytes)
Yottabyte (1 000 000 000 000 000 000 000 000 bytes)
Data Size Projection
Evolution of Big Data
• All these factors together, generated humongous data that was not possible to store,
manage and analyze in conventional relational and data warehouse systems
• There was a need to store this in a way that is cheaper and allows the analysis of this
data quickly
• Companies like Yahoo, Google and Microsoft put a lot of money into research and
opened it to the open source community
• This resulted in birth of Hadoop that provided a software framework to store data on
distributed commodity hardware and process it using map reduce model
Evolution of Big Data – Artificial Intelligence/Machine
Learning
• With the evolution of data, it was not possible to analyze all this data
manually
• Companies started to use Artificial Intelligence to make sense out of this
• With the evolution in hardware speed, companies are now using Machine
Learning/Deep learning to create predictive models
Big Data/AI Use cases:
Customer profiling and Recommendation
• Most common use case that almost everybody has experienced is recommendation
engines on various commodity websites
• Used by most of e-commerce and subscription website like Amazon, ebay, Netflix
etc
• Also used by social media websites to serve content based on your profile
• Data from user base is profiled on regular basis, user actions are watched on real time
basis and recommendations are made as per user interests
Big Data/AI use cases: Connected Vehicles
•Newer vehicles are capable of sending data every second

•Every vehicle is sending 100’s of MB of data on the daily basis
•Data includes dozens of data points including:
location
monitoring data from various systems from vehicle such as
odometer, engine etc
Users actions such as speed of car, position of window, door lock,
headlight status, pressing of any button on the dashboard etc
• There are devices available that can be connected to older vehicles to
get similar information
Big Data/AI Use Case: Connected Cars
•Auto companies plan to use this data to do predictive maintenance

Imagine getting a call from your car dealer requesting you to bring
your car to dealership so that they can replace a part that might fail
in few days
• Ride share/Transportation companies already using such data to keep
track of driver’s driving behaviors
• Insurance companies trying to use this data to decide more accurate
insurance premiums
Big Data/AI use cases: connected cars
real time trip analysis image
Source: http://www.qstarz.com/Products/GPS%20Products/CR-Q1100Vbc-F.html
Big Data/AI Applications – Smart Meters
Big Data/AI Applications – Smart Meters
•Allows consumers to track their usage almost real time

•Allows service providers to enforce regulations
•Helps in leak detections
Big Data application: Sports Analysis
•A bunch of major league teams are using sensors and AI to analyze

their team performance and finding ways to improve the performance
of the team.
• IoT provides data that was not available before and also provides
quantitative way to measure the performance.
Big Data/AI use cases: Others
• Automatic Stock Market Transactions
 Investment companies (Hedge Funds, Mutual Funds) have machine learning
algorithms that keep on watching various feeds and capable of making automatic
transactions.
 It gives them slight advantage of time that can convert into millions of dollars
• Fraud Detection
 Credit Card companies use big data/AI to find anomalies in real time transaction
Big Data Challenges - Security
• Security and privacy is a myth
• Only 2 things can save you from privacy intrusion:
 YOU!
 Regulations
• Europe is very serious about bringing stringent privacy regulations
• Recently introduced GDPR that provides the “Right to forget” to the public
Big Data Challenges: Misuse of the data
• Companies know about you more than they should
• Consumers providing them this data for using their products for free
• Companies analyze data and reach out to potential targets using
recommended pages and products.
• This is the potential of changing the fabric of society
Big Data/AI Challenges: Quick Reaction
In 2013, The Associated Press (AP) account was hacked and one fake tweet
was posted form the account. It wiped out $130 Billion dollars from stock
market
Source: https://www.telegraph.co.uk/finance/markets/10013768/Bogus-AP-tweet-about-explosion-at-the-White-House-wipes-billions-off-US-markets.html
Big Data Challenges: ML/AI not always accurate
Cats Vs Dogs Detection Problem

Teacher’s evaluation example
Uber using Big Data Analysis
Sources of Big Data for AI 2.0
●Personal
Personal Data:
Data: Data Data from People
from People
○ Health
Health
Education
○ Education
Food Security
○ Food
Energy Security
Security
Water Security
○ Energy Security
Other: Medicine/Pharmaceuticals, Shelter/Housing, Sanitation
●
National/Local Data: Data Data:
National/Local
Transportation
from Places
Dataandfrom
ThingsPlaces and Things
○ Transportation
Telecom/Smartphone/Communication/WiFi
○ Telecom/Smartphone/Communication/WiFi
Banking
Entertainment
○ Banking
Shopping
○ Entertainment
Data – Data from Unplanned Events (Black Swans)
○ Shopping
Global
Earthquakes/Tsunami
Typhoons/Hurricanes/Cyclones
●Global Data – Data from
Fire Unplanned Events (Black Swans)
Flooding
○ Earthquakes/Tsunami
○ Typhoons/Hurricanes/Cyclones
○ Fire
○ Flooding
What to Collect?
Data Relevance: Big Data Parameters
Volume – Size and Scale
■ Up to 40,000 sensors in the Airbus A380
■ 7 TB per day
Velocity – Data Rate and Streaming Data
■ Sensor data collected in msec
■ Type and No of Sensors
Variety – Cross Media Data
■ Sensors
■ Images and videos
■ Text data
■ Relational business data
Validity - Reliability
■ Poor data quality
■ Missing data
■ Data collected doesn't suit targeted use cases
Value – Usefulness and Importance
■ Data per se is not valuable
■ How to extract real value from data?
Necessary Conditions for Collection
and Use of Big Data
●Infrastructure
●Instrument Data Sources

○People
○Places and
○Things
●Computing Power: Processor, Memory and Bandwidth

○Multi Farm Cloud Computing
○Super Computers for Processing
○Zettabyte (1021 Bytes) Storage Farms
○Million Gigabit bandwidth
●Machine Learning and Analytics

How Spotify Uses Artificial Intelligence and Big Data To Give Us A
Great Music Experience
Big Data from Health
•Monitor Life Support Parameters
•Using Improvised Smart Phone
•Using a Low Cost Version of Apple Watch
and Fitbit
•Body Media Implant or Capsule in the
Future?
•Analyze the Data for Tell Tale Signals
•Notify and/or Warn the User of Impending
Problems
•Guardian Angel Service Providers For Health
•Devices, Tools and Apps
Big Data from Education
•Keystroke Activity Monitoring Exposes Student
Learning Behavior
 Attendance
Paying Attention or Playing games?
 Learning Speed
Problem Solving Speed
•Valuable Tool for Student, Parents and Educators
 Timely Notifications and Warnings
•Guardian Angel Service Providers For Education
Devices, Tools and Gat Apps
Big Data from Emergencies
Provide the Right Information to the Right People
 Right Information
 To the Right People
 At the Right Time
 In the Right Language
 In the Right Medium: Voice, Video and/or Text
 In the Right Level of Detail
Data to Knowledge: Machine Learning
Machine Learning is the Key to Unlocking Big Data
Machine Learning - Complexity and Automation Levels
• Data Analytics
Monitor : What happened?
Diagnosis : Why did it happen?
Prediction : What will happen?
Prescription : What is to be done?
18
Role of Machine Learning in Big Data
 Anomaly Detection
 Healthy Individuals vs. Persons with Potential Problems
 Classification
 Clustering into Groups of Similar Populations
 Failure Prediction
 Sensor-based Prediction of Future Health Problems
 Prescriptive Analytics and Optimization

 Recommend Actions and Optimize Activities for Preventive
Services like Flu Shots and Screening Tests

 Correlations
 Support for Root Cause Analysis like DNA Inheritance
 Forecasting
 Predict Life Expectancy for Insurance
Companies Using Behavioral analytics
● 1.8 billion customers

● analyze the behavior of customers in not only their own stores, but also
thousands of other retailers.
● Teamed up with Mu Sigma to collect and analyze data on shoppers’
behavior.
● Services like Hulu and Netflix competing for viewers’ attention, Time
Warner collects user data such as
how frequently customers tune in,
the effect of bandwidth on consumer behavior,
customer engagement
peak usage times
● The company also segments its customers for advertisers by

 correlating viewing habits with public data—such as voter
 registration information
Companies Using Behavioral analytics
● Customer complaints and PR crises have become more difficult to handle thanks to
social media.
● Nestle created a 24/7 monitoring centre to listen to all of the conversations about the
company and its products on social media.
● The company will actively engage with those that post about them online in order to
mitigate damage and build customer loyalty.
● McDonalds tracks vast amounts of data in order to improve operations and boost the
customer experience.
● The company looks at factors such as
○ the design of the drive-thru,
○ information provided on the menu,
○ wait times,
○ the size of orders and
○ ordering patterns
Problems with Big Data Machine Learning
●High dimensional data – Find the relevant features? Needs the

involvement of domain experts and/or automatic feature selection techniques
●Data Quality is poor: Data is not collected to be used for Machine Learning.
There are no standards with regards to sensor data. Difficult to integrate
different data sources.
●Rare event problem: Standard Machine Learning algorithms achieve poor
results. Needs special algorithms for unbalanced classes.
●No labels: Use unsupervised learning algorithms (e.g. anomaly detection)
●Use-case specific algorithms and data models: Needs flexibility and
extensibility
●Deployment challenge: Need for an automatic system for model deployment
and management
Machine Learning Steps in Big Data
●Sensor Data Acquisition
○ Pre-process and explore data and detect patterns and outliers.
○ 80% of the time of data scientists is spent with pre- processing
●Learning
○ Use domain user annotations as labels and sensor data as well as
business data to learn machine learning models
●Prediction
○ Use the learned model and apply to new data
●Feedback
○ Ask domain users to annotate patterns and anomalies
●Recommendation
○ Recommend steps that should be done by the domain user
● Action 1
○ Asses recommendations and act accordingly if appropriate

Introduction: ETHICS vs MORALS
In most cases, when any one of us does something, we try

to act according to what society believes is right.
More often, we listen closely to what our own beliefs

about right or wrong are telling us, even if they’re
different from society’s views.
These two have to do with ethics vs morals. Even though these words are used
interchangeably by many, there is a big difference in the meanings.
ETHICS
● ETHICS refers to the rules that a social system provides us

with.
● These are the codes of how to act in a workplace, in a
public place, in a church, or anywhere else where other
people are present.
● It’s necessary to act according to ethical principles even if
they don’t agree with your own feelings. Because If you
don’t, other people are likely to start judging you.
MORALS
● MORALS, in contrast, are our own principles. When we act
according to morals, we do something because we
personally are certain that this is the right thing to do.
● Ethics vs Morals Examples
○ The ethics of journalism are much debated.
○ She resigned over an issue of personal ethics.
○ The ethics of his profession don’t permit him to do that.
○ The play was considered an affront to public morals.
○ She had always been nice about her morals, she grew nicer still.
○ The novel reflects the morals and customs of the time.
ETHICS vs MORALS
ETHICS vs MORALS
ETHICS vs MORALS
ETHICS vs MORALS
ETHICS vs MORALS
ETHICS vs MORALS
ETHICS vs MORALS
BUSINESS ETHICS
ETHICS & VALUES:IMPORTANCE IN BUSINESS
▪ Every business place work on certain principles and beliefs

which are nothing but the values. Likewise,
▪ Ethics is implemented in the organization to ensure the
protection of the interest of stakeholders like customers,
suppliers, employees, society and government.
▪ Caveat emptor: This ancient Latin proverb let the buyer
beware. It is good advice whenever a person purchases
anything. Being aware of the fitness, quality, and
circumstances of a purchase is just good sense.
Case Study : Unethics in Business
●Ethics and values can bring behaviour, corporate social

responsibility and significant benefits to a business. For
example, they may:
○ Attract customers to the firm's products, thereby
boosting sales and profits.
●Make employees want to stay with the business, reduce

labour turnover and therefore increase productivity.
●Attract more employees wanting to work for the business,
reduce recruitment costs and enable the company to get
the most talented employees.
ETHICAL ISSUES
●Ethical Issues arises when following factors are not taken

into much consideration or neglected:-
○privacy,
○lack of transparency,
○consent and
○power.
ETHICAL ISSUES
●Ethical Issue: A problem or situation that requires a person

or organization to choose between alternatives that must be
evaluated as ethical or unethical.
●This may come in an obvious form, like manipulating
numbers in a report or spending company money on
inappropriate activities.
●however, it can also occur more subtly, in the form of
bullying, accepting inappropriate gifts from suppliers, or
asking you to skip a standard procedure just once.
ETHICAL ISSUES IN CAPITALISM
●Capitalism:- is an economic system in which the means of

production and distribution are privately or corporately
owned. Operations are funded by profits, and not controlled
by a state government.
● Human Welfare Ignored: The economic decisions made by individual
entrepreneurs and producers under capitalismare based on their self
interest and not from the point of view of good of the society.
 Social welfare is ignored altogether. As other Rights take

Precedence over Human rights. Money, not man, rules the world
and debases humanity.
Social Responsibility of business
● Social responsibility means that individuals and companies have a duty to act in the
best interests of their environment and society.
● The concept of social responsibility protect the interests of other members of a society
such as workers, consumers and the community as a whole.
● The objective of managers for taking business decisions is not merely to maximize
profits but also to serve and protect the interests of other members.
● Social responsibility is related to the concept of ethics.
● Ethics is the discipline that deals with moral duties and
obligations.
● Social responsibility implies corporate enterprises should
follow business ethics and work for not only to maximise their
profits or shareholders’ value but also to promote the interests
of other stakeholders and the society.
● Lack of social responsibility of business witnessed in India are:

Example:-Bhopal Gas Leak Tragedy.
● There was a leakage of poisonous gas from factory which
resulted in the death of more than 2000 poor people and about 2
lakh persons were badly injured.
Oil Spills
Example
● Netflix offers their employees 52 weeks of paid parental leave,
which applies to both parents.
- Within that time, employees have the option of going back to
work and then resuming their paid leave as it suits them.
● Dell: Dell contributes to environmental management by
shipping their laptops in less wasteful containers using more
eco-friendly materials.
● Ford Motor Company: Ford is another corporation attempting
to improve their environmental performance by offering 13
new electric vehicle models by 2020.
Example
These are the most common examples of corporate social
responsibility:
● Reduce carbon footprints to mitigate climate change
● Improve labor policies and embrace fair trade
● Engage in charitable giving and volunteer efforts within your
community
● Change corporate policies to benefit the environment
● Make socially and environmentally conscious investments
● Hold an annual tree-planting event.
● Set up recycling bins throughout your facilities
● Minimize your amount of paper waste
Contribution by Company for the society through its
customers
Example of social responsibility in business
Google notably earned the Reputation Institute’s highest
Corporate Social Responsibility (CSR) score by implementing the
following reductions in environmental impact:
● Data centers use less power, with 50% less energy than other
comparable data centers
● Google contributes to renewable energy,
● committing more than $1 billion to
● renewable energy projects
● The company enables businesses to decrease their own
environmental impact by using Gmail
What are Ethical conflicts
● An ethical conflict occurs when you or a colleague makes a
decision that could be seen as illegal or inappropriate to a
third-party.
● Ethical conflicts result from the smallest lies to decisions that
can affect employees within the company, investors or
customers.
● There are many types of ethical conflicts in the workplace,
however conflicts usually deal with the following categories:
○ Fraud
○ Confidentiality
○ finance
● Fraud occurs when a company knowingly presents
information that is incorrect to employees or the public.
● A confidentiality ethical conflict occurs when information is
viewed or accessed by a party that should not be privy to that
information.
● Ethical conflicts that develop from dishonesty usually occur
because a company does not provide a complete picture of
information to customers or employees.
● There are three main ethical challenges related to data and
data science:
○ Unfair Discrimination
○ Reinforcement of Human Biases
○ Lack of Transparency.
● Unfair Discrimination: If data reflects unfair social biases against sensitive
attributes such as Race or Gender, then the inferences drawn from that
data might also be based on the said bias.
● Reinforcement of Human Biases: This kind of problem may arise when
various computer models are used in making predictions in areas such as
Insurance, Financial Loans, and Policing.
Stakeholder Interests
● Ethical conflicts arise when stakeholder interests differ.
For example, a company may choose to maximize production
and pay little attention to the quality of the end product while the
customers expect the product to meet their specifications.
Customers trust the organization to get it right the first time.
● An ethical conflict occurs when the interests of two employees
are at odds.
For example, two employees are up for one promotion and one
takes credit for the work of the other.
● The decision-maker needs to find a way to figure out who
deserves credit for the work and make the decision accordingly.
Nestlé: Case Study
 In early 70’s claimed to provide baby milk powder which can
replace mother’s milk as baby food.
 Claim proved to be false as
 Lacks nutrients and antibodies
 found in mother’s milk
 Creates addiction to babies
 High in cost
 Nestlé got banned in USA(1977), U.K.(1980) and
Canada(1984).
Whistle Blowing
● When a former or the existing employee of the organization raise his
voice against the unethical activities being carried out within the
organization is called as whistle blowing.
● The whistle blowing is done to safeguard the interest of the society and
the general public for whom the organization is functioning.
Whistle Blowing
Whistle Blowing
Most often, the employees fear to raise a voice against the illegal activity being
carried out in the organization because of following reasons:
 Threat to life
 Lost jobs and careers
 Lost friendships
 Breach of trust and loyalty

Whistle Blowing
Whistleblower Protection Act
The Whistleblower Protection Act, which was enacted in 1989, and
strengthened in 2012, specifically protects people who work for the
federal government, and inform on illegal or improper activities
conducted by the government. The Whistleblower Protection Act
protects federal employees from potential retaliation from the
government. Examples of whistleblower retaliation may include:
● Termination of employment
● Demotion
● Suspension
● Threats or harassment
● Discrimination
Whistle Blowing Examples
● Corruption: This type of case includes reporting a broad
range of illegal conduct. Bribery is one of the best-known
examples, but corruption also covers fraud.
● Racial discrimination: If two employees in similar
situations are treated differently as a result of one
person’s race, color, descent, national or ethnic origin, or
immigrant status, this is legally considered racial
discrimination.
● Fraud: Fraud refers to any wrongful or criminal deception
undertaken to gain financial or personal benefits.
Whistle Blowing in Data Matters
● Over the past five years, a higher volume of data has been
released to the public than the previous 50 years put
together.
● This wave of leaks and breaches means that media
outlets, the public, and political systems need to decide
how best to serve the public interest when data is made
available online.
● A data breach is a compromise of security that leads to
the accidental or unlawful destruction, loss, alteration,
unauthorised disclosure of, or access to protected data.
● A data leak is a data breach where the source of the data is
from someone inside the organisation or institution that has
collected that data.
● This usually takes the form of whistleblowing (the act of
telling authorities or the public that someone else is doing
something immoral or illegal).
Biggest Breaches
Case Study of Coca Cola
MARKETING ETHICS
● Marketing is the science and art of exploring, creating, and
delivering value to satisfy the needs of a target market at a
profit.
● Marketing ethics is an area that deals with the moral
principles behind marketing behind the operation and
regulation of marketing activities.
● Ethics in marketing applies to different spheres such as in
product, pricing, promotion & advertising etc.
● Dimension of social responsibility that involves
principles and standards the define acceptable conduct in
marketing.
ETHICAL ISSUES IN MARKETING
.
UNETHICAL PRACTICES IN MARKETING
● Many retailers sell products that have crossed
expiry date is unethical.
● Promising shipment when knowing delivery is not
possible by the promised date is also unethical.
● Most drug stores would give too many drugs
without prescription from a qualified doctor are
also unethical.
● Products are moved in unsafe vehicles ,are also
unethical.
UNETHICAL PRACTICES IN FINANCE
● Delays in paying wages, interest to financiers,
incentive, bonus to employees.
● Cheating employees of their dues towards medical
expenses, leave travel assistance, children
education fees etc.
● Creating bogus bills of purchase to show higher
costs and hence losses to avoid bonus payment to
employees.
Data Science
Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms and
systems to extract knowledge and insights from
structured and unstructured data. Data science is
related to data mining and big data.
Data Science and Machine Learning
interdisciplinary field about

scientific methods… to extract
knowledge or insights from
data...uses techniques from many
fields
Data
…mathematics, statistics, information Science
science, and computer science, in
particular from the subdomains of
machine learning.
Is a subfield of AI, it gives

computers the ability to Machine
learn without being Learning
explicitly programmed
Artificial
… Intelligent Amplifiers": Use of AI Intelligence
Technology
to augment human intelligence
1
Ethics in data science
● Data has been integrated into every aspect of our life: the
friends and business connections we are asked to make, the
shopping circulars we receive in the mail, the news we see, and
the songs we’ve played.
● Data is collected from us at every turn: every trace of our
online presence, and sometimes even traces of our physical
presence.
● We’ve gained some advantages from data, but we’ve also seen
the damage that the misuse of data has caused.
Facebook Case Study
● What we’re still missing is an understanding for how to put

ethics into practice in data as well as the overall product
development process.
● Ethics really isn’t about agreeing to a set of principles. It’s
about changing the way you act.
● It’s one thing to say that you should get permission from users
before using their data in an experiment.
● It’s quite another thing to get permission at web scale.
● Data Ethics is concerned with the following principles:

○ Ownership - Individuals own their own data.
○ Transaction Transparency - If an individual’s personal
data is used, they should have transparent access to the
algorithm design used to generate aggregate data sets.
○ Consent - If an individual or legal entity would like to use
personal data, one needs informed and explicitly
expressed consent of what personal data moves to whom,
when, and for what purpose from the owner of the data.
● Data Ethics is concerned with the following principles:

○ Privacy - If data transactions occur all reasonable effort
needs to be made to preserve privacy.
○ Currency - Individuals should be aware of financial
transactions resulting from the use of their personal data
and the scale of these transactions.
○ Openness - Aggregate data sets should be freely available
Why we need data ethics:-
● The future will be completely driven by Machine Learning

and Data Science forms the epicenter to this feature.
● Machines are fuelled by the data they are trained on.
● Every advertisement, every self-driving cars, every medical
diagnosis provided by a machine will be based on certain
data.
● Data Ethics is a rapidly improvising field-of-study.
● A failure to handle data ethically can severely impact people
and could lead to a loss of trust in projects, products, or
organizations.
Concept of Informed Consent
● Informed consent is a process where researchers inform
participants about a study, and the participants willingly choose
whether they want to participate.
● Four main components:
○ Information: Participants must be given all relevant
information. This includes what it will take to be a participant,
the risks and benefits of participating, how the data will be
used and protected, etc.
○ Understanding: It’s not enough to just be given information;
participants have to adequately understand the information.
This means that a key part of informed consent is making sure
that the information is communicated well and people
understand it.
Concept of Informed Consent
● Four main components:

○ Volunteering: Participants have to genuinely volunteer to
participate, not be coerced, manipulated or persuaded in any
way.
○ Decision-making capacity: Informed consent requires that
participants can weigh the risks and benefits and come to their
own decision about whether they want to participate. It’s
important to remember that some groups, such as children
and people with mental disabilities, may not have this full
decision-making capacity.
What should be included in informed consent?
● The surveyor’s identity

● Surveyor’s contact information
● The survey’s purpose
● How participants are chosen
● Voluntary participation
● Survey process
● Risks and rewards
● Results
● Data privacy
● Data use
● Permission for recording
● Thank you
How can you get informed consent?
● The explanation can be given verbally or in writing. There are pros

and cons to each:
● With Written explanations, consistency is no problem. It’s easy to
ensure that every participant gets the same explanation, given in
the same way. However, people are more likely to skim (and thus
not fully understand) written explanations, rather than read them
in-depth.
● Verbal explanations can be more engaging than written
explanations. However, it’s important that each researcher reads
off the same script, and there should be some way to ensure that
researchers are following the script.
Data Ownership
● Data ownership is the act of having legal rights and complete

control over a single piece or set of data elements.
● It defines and provides information about the rightful owner of
data assets and the acquisition, use and distribution policy
implemented by the data owner.
● Data ownership is primarily a data governance process that
details an organization's legal ownership of enterprise-wide data.
● A specific organization or the data owner has the ability to create,
edit, modify, share and restrict access to the data.
● Data ownership also defines the data owner’s ability to assign,
share or surrender all of these privileges to a third party.
Data Privacy
● Data privacy or information privacy is a branch of data

security concerned with the proper handling of data –
consent, notice, and regulatory obligations. More
specifically, practical data privacy concerns often revolve
around:
○ Whether or how data is shared with third parties.
○ How data is legally collected or stored.
○ Regulatory restrictions such as GDPR.
Why is Data Privacy Important
● Data is one of the most important assets a company has.

With the rise of the data economy, companies find enormous
value in collecting, sharing and using data.
● Companies such as Google, Facebook, and Amazon have all
built empires atop the data economy.
● Transparency in how businesses request consent, abide by
their privacy policies, and manage the data that they’ve
collected is vital to building trust and accountability with
customers and partners who expect privacy.
Challenges of data privacy
● Liability of online service providers for third

party content
● Conflicts with privacy laws
● Whistleblower schemes and data protection
law
● Information requests and data protection
law
Confidentiality and Anonymity
● Confidentiality and anonymity are ethical practices

designed to protect the privacy of human subjects while
collecting, analyzing, and reporting data.
● Confidentiality refers to separating or modifying any
personal, identifying information provided by
participants from the data.
● By contrast, anonymity refers to collecting data without
obtaining any personal, identifying information.
Confidentiality and Anonymity
● Typically, anonymity is the procedure followed in

quantitative studies, and confidentiality is maintained in
qualitative studies.
● In both cases, the researcher gathers information from
participants, and it is this information that becomes the
data to be analyzed.
● For the social scientist, people's’ behaviors and
experiences are of great interest, rather than an exposé
about individuals.
Data Validity
● Data validation is a process that ensures the delivery of

clean and clear data to the programs, applications and
services using it.
● It checks for the integrity and validity of data that is
being inputted to different software and its components.
● Data validation ensures that the data complies with the
requirements and quality benchmarks.
Data Validity
● Data validation is intended to provide certain well-

defined guarantees for fitness, accuracy, and
consistency for any of various kinds of user input into an
application or automated system.
● Data validation rules can be defined and designed using
any of various methodologies, and be deployed in any of
various contexts
Different kinds of data validation
● Data-type check
● Simple range and constraint check
● Code and cross-reference check
● Structured check
● Consistency check
● Range check
● Criteria
Algorithmic Fairness
● Algorithmic fairness is increasingly important because as

more decisions of greater importance are made by
computer programs, the potential for harm grows.
● Today, algorithms are already widely used to determine
credit scores, which can mean the difference between
owning a home and renting one.
● And they are used in predictive policing, which suggests
a likelihood that a crime will be committed, and in
scoring how likely a criminal will commit another crime
in the future, which influences the severity of
sentencing.
Doing Good Data Science
● We already have standards for data ethics.

● The ACM’s code of ethics, which dates back to 1993, and is
currently being updated, is clear, concise, and surprisingly
forward-thinking; 26 years later, it’s a great start for anyone
thinking about ethics.
● How do we put ethical principles into practice?
● Ethical principles are worse than useless if we don’t allow
them to change our practice, if they don’t have any effect on
what we do day-to-day.
● Any code of data ethics will tell you that you shouldn’t collect
data from experimental subjects without informed consent.
● But that code won’t tell you how to implement “informed
consent.”
● Informed consent is easy when you’re interviewing a few
dozen people in person for a psychology experiment.
● Informed consent means something different when someone
clicks an item in an online catalog
● We need the ability to have conversations about what ethics

means, what it will cost, and what solutions to implement.
● As technologists, we frequently share best practices at
conferences, write blog posts, and develop open source
technologies — but we rarely discuss problems such as how to
obtain informed consent.
● Foremost, we need corporate cultures in which discussions
about fairness, about the proper use of data, and about the
harm that can be done by inappropriate use of data can be
considered. In turn, this means that we can’t rush products out
the door without thinking about how they’re used.
Oaths
● Data oaths, similar to the ancient Hippocratic Oath for doctors.
● Much as we appreciate the work and the thought that goes into
oaths, we are skeptical about their value.
● Oaths have several problems:
● They’re one-shots. You take the oath once (if at all), and that’s it. There’s no
reason to keep it in the front of your consciousness.
● Oaths can actually give cover to people and organizations who are doing
unethical work. It’s easy to think “we can’t be unethical, because we
endorsed this oath.”
● Oaths do very little to connect theories and principles to practice. It is one
thing to say “researchers must obtain informed consent”; it’s an entirely
different thing to get informed consent at internet scale. Or to teach users
what “informed consent” means.
Checklists
● Checklists are built around simple, “have we done this?” questions

— and they are effective because they are simple.
● They don’t leave much room to wiggle. Either you’ve analyzed
how a project can be abused, or you haven’t.
● You’ve built a mechanism for gathering consent, or you haven’t.
● Granted, it’s still possible to take shortcuts: your analysis might be
inadequate and your consent mechanism might be flawed, but
you’ve at least gone on record for saying that you’ve done it.
Example
Here’s a checklist for people who are working on data projects:

❏ Have we tested our training data to ensure it is fair and
representative?
❏ Have we studied and understood possible sources of bias in
our data?
❏ What kind of user consent do we need to collect to use the
data?
❏ Do we have a mechanism for gathering consent from users?
❏ Have we explained clearly what users are consenting to?
The Five C’s
● We need guidelines to force discussions with the application
development teams, application users, and those who might
be harmed by the collection and use of data.
● Five framing guidelines help us think about building data
products.
● We call them the five Cs:
● consent
● clarity
● Consistency
● control (and transparency)
● and consequences (and harm).
Consent
● You can’t establish trust between the people who are

providing data and the people who are using it without
agreement about what data is being collected and how that
data will be used.
● Agreement starts with obtaining consent to collect and use
data.
● Unfortunately, the agreements between a service’s users
(people whose data is collected) and the service itself (which
uses the data in many ways) are binary (meaning that you
either accept or decline) and lack clarity.
Examples- Violations
● In Europe, collected data from cameras mounted on cars to
develop new mapping products.
● AT&T and Comcast both used cable set top boxes to collect data
about their users,
● Samsung collected voice recordings from TVs that respond to
voice commands.
There are many, many more examples of nonconsensual data
collection.
At every step of building a data product, it is essential to ask
whether appropriate and necessary consent has been provided.
Clarity
● Clarity is closely related to consent.

● You can’t really consent to anything unless you’re told clearly
what you’re consenting to.
● Users must have clarity about what data they are providing,
what is going to be done with the data, and any downstream
consequences of how their data is used.
● All too often, explanations of what data is collected or being
sold are buried in lengthy legal documents that are rarely read
carefully, if at all.
Example
● Users who played Cambridge Analytica’s “This Is Your Digital

Life” game may have understood that they were giving up their
data; after all, they were answering questions, and those
answers certainly went somewhere.
● But did they understand how that data might be used? Or that
they were giving access to their friends’ data behind the
scenes? That’s buried deep in privacy settings.
Consistency and trust
● Consistency, and therefore trust, can be broken either

explicitly or implicitly.
● An organization that exposes user data can do so intentionally
or unintentionally.
● In the past years, we’ve seen many security incidents in which
customer data was stolen:
● Yahoo!
● Target
● Anthem
● local hospitals
Example
● Facebook didn’t consistently enforce its agreement with its

customers.
● When the news broke, became unpredictable because most of
its users had no idea what it would or wouldn’t do.
● They didn’t understand their user agreements, they didn’t
understand their complex privacy settings, and they didn’t
understand how would interpret those settings.
Control and Transparency
● Once you have given your data to a service, you must be able
to understand what is happening to your data. Can you control
how the service uses your data?
● All too often, users have no effective control over how their
data is used.
● They are given all-or-nothing choices, or a convoluted set of
options that make controlling access overwhelming and
confusing.
● It’s often impossible to reduce the amount of data collected,
or to have data deleted later.
Example
● Facebook asks for (but doesn’t require) your political views,

views, and gender preference.
● What happens if you change your mind about the data you’ve
provided?
● If you decide you’re rather keep your political affiliation quiet,
do you know whether actually deletes that information?
● Do you know whether continues to use that information in ad
placement?
● Europe’s General Data Protection Regulation (GDPR) requires
users’ data to be provided to them at their request and
removed from the system if they so desire
Consequences
● Data products are designed to add value for a particular user

or system.
● As these products increase in sophistication, and have broader
societal implications, it is essential to ask whether the data
that is being collected could cause harm to an individual or a
group.
● We continue to hear about unforeseen consequences and the
“unknown unknowns” about using data and combining data
sets.
Examples
● There are laws to protect specific sensitive data sets:

● For example, the Genetic Information Nondiscrimination Act
(GINA) was established in 2008 in response to rising fears that
genetic testing could be used against a person or their family.
● Unfortunately, policy doesn’t keep up with technology
advances; neither of these laws have been updated.
Implementing the Five Cs
● The five Cs are a mechanism to foster dialogue to ensure the

products “do no harm.”
● The five Cs need to be part of every organization’s culture.
● Product and design reviews should go over the five Cs
regularly.
● They should consider developing a checklist before releasing a
product to the public.
Data’s Day of Reckoning
● Although we’ve benefited from the use of data in countless

ways, it has also created a tension between individual privacy,
public good, and corporate profits.
● It is time for us to take responsibility for our creations.
● Responsibility is inevitably tangled with the complex
incentives that surround the creation of any product.
Ethics and Security Training
● Students may study ethical principles, but they don’t learn

how to implement those principles in their projects.
● As a result, they are ill-prepared for the challenges of the real
world.
● They’re not trained to think about ethical issues and how they
affect design choices.
● They don’t know how to have discussions about projects or
technologies that may cause real-world harm.
Example
● The White House report “Preparing for the Future of Artificial

Intelligence” highlights the need for training in both ethics and
security:
● Ethical training for AI practitioners and students is a necessary
part of the solution.
● Ideally, every student learning AI, computer science, or data
science would be exposed to curriculum and discussion on
related ethics and security topics.
Developing Guiding Principles
● The problem with ethical principles is that it’s easy to forget

about them when you’re rushing: when you’re trying to get a
project finished on a tight, perhaps unrealistic, schedule.
● When the clock is ticking away toward a deadline, it’s all too
easy to forget everything you learned in class — even if that
class connected ethics with solutions to real-world problems.
● Checklists are a proven way to solve this problem.
● You don’t go to the next stage until you’ve answered all the
questions affirmatively.
Building Ethics into a Data-Driven Culture,
● Individual responsibility isn’t sufficient.

● Ethics needs to be part of an organization’s culture.
● We’ve seen many organizations recognize the value of
developing a data-driven culture; we need to ensure ethics and
security become part of that culture, too.
● An individual needs to be empowered to stop the process
before damage is done.
Example
● Toyota and W. Edwards Deming pioneered the use of the

andon cord to improve quality and efficiency.
● Anyone who saw a problem could pull the cord, which would
halt the production line.
● Senior managers as well as production line operators would
then discuss the issue, make improvements, and restart the
process.
● Any member of a data team should be able to pull a virtual
“andon cord,” stopping production, whenever they see an
issue.
Example
● Interviewers rarely ask questions about the candidate’s ethical

values.
● Rather than asking a question with a right/wrong answer,
we’ve found that it’s best to pose a problem that lets us see
how the candidate thinks about ethical and security choices.
● Here’s a question that is useful: Assume we have a large set of
demographic data. We’re trying to evaluate individuals and
we’re not supposed to use race as an input. However, you
discover a proxy for race with the other variables. What would
you do?
Regulations
● In some industries, ethical standards have been imposed by

law and regulation.
● The European Union’s General Data Protection Regulation
(GDPR) takes an aggressive approach to regulating data use
and establishing a uniform data policy.
● By the time a policy has been formulated and approved, it
almost always lags behind technology; but it’s impossible for
policy makers to iterate quickly enough to catch up with the
newest technology.
Indian Government
Whatsapp Scandal
Building our Future
● We’re looking at a future in which most vehicles are

autonomous;
● We will be talking to robots with voices and speech patterns
that are indistinguishable from humans
● and where devices are listening to all our conversations,
● ready to make helpful suggestions about everything from
restaurants and recipes to medical procedures.
● The shape of the future will depend a lot on what we do in the
next few years.
Building our Future
● We need to incorporate ethics into all aspects of technical

education and corporate culture;
● we need to give people the freedom to stop production if
necessary, and to escalate concerns if they’re not addressed.
● We need to incorporate diversity and ethics into hiring
decisions; and we may need to consider regulation to protect
the interests of individual users, and society as a whole.
● We can build a future we want to live in, or we can build a
nightmare. The choice is up to us.

Big Data and Data Ethics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data and Data Ethics

Uploaded by

Copyright:

Available Formats

Introductory Talk

DR. VISHAL GOYAL

∙ Speech to Speech Dialog

∙ Deep Question Answering

2003 2004 2005 2006

•Newer vehicles are capable of sending data every second

•Auto companies plan to use this data to do predictive maintenance

•Allows consumers to track their usage almost real time

•A bunch of major league teams are using sensors and AI to analyze

Cats Vs Dogs Detection Problem

●Instrument Data Sources

●Computing Power: Processor, Memory and Bandwidth

●Machine Learning and Analytics

Monitor : What happened?

Diagnosis : Why did it happen?

Prediction : What will happen?

Prescription : What is to be done?

 Prescriptive Analytics and Optimization

Services like Flu Shots and Screening Tests

● 1.8 billion customers

● The company also segments its customers for advertisers by

●High dimensional data – Find the relevant features? Needs the

○ Asses recommendations and act accordingly if appropriate

In most cases, when any one of us does something, we try

More often, we listen closely to what our own beliefs

● ETHICS refers to the rules that a social system provides us

▪ Every business place work on certain principles and beliefs

●Ethics and values can bring behaviour, corporate social

●Make employees want to stay with the business, reduce

●Ethical Issues arises when following factors are not taken

●Ethical Issue: A problem or situation that requires a person

●Capitalism:- is an economic system in which the means of

 Social welfare is ignored altogether. As other Rights take

● Lack of social responsibility of business witnessed in India are:

 Lost jobs and careers

 Breach of trust and loyalty

interdisciplinary field about

Is a subfield of AI, it gives

● What we’re still missing is an understanding for how to put

● Data Ethics is concerned with the following principles:

● Data Ethics is concerned with the following principles:

● The future will be completely driven by Machine Learning

● Four main components:

● The surveyor’s identity

● The explanation can be given verbally or in writing. There are pros

● Data ownership is the act of having legal rights and complete

● Data privacy or information privacy is a branch of data

● Data is one of the most important assets a company has.

● Liability of online service providers for third

● Confidentiality and anonymity are ethical practices

● Typically, anonymity is the procedure followed in

● Data validation is a process that ensures the delivery of

● Data validation is intended to provide certain well-

● Algorithmic fairness is increasingly important because as

● We already have standards for data ethics.

● We need the ability to have conversations about what ethics

● Checklists are built around simple, “have we done this?” questions

Here’s a checklist for people who are working on data projects:

● You can’t establish trust between the people who are

● Clarity is closely related to consent.

● Users who played Cambridge Analytica’s “This Is Your Digital

● Consistency, and therefore trust, can be broken either

● Facebook didn’t consistently enforce its agreement with its