Big Data

BIG DATA
 Big Data is a collection of large and complex, data sets that are difficult
to store and process using traditional database management and
processing applications. The challenge includes capturing, curating,
storing, searching, sharing, transferring, analyzing and visualization of
this data
 Big Data is also data but with a huge size.

5 V OF BIG DATA
 Volume: The gravitas that the term big data owns is because of its volume.
 Velocity: The superiority of processing data with high accuracy and speed.
 Variety: The different types of data, Structured, Unstructured and Semi-
Structured.
 Veracity: The quality and consistency of data.
 Value: The end-stage is to extract useful data
8 VS OF BIG DATA
DATA RELATED JOBS
Big data is collected from a variety of sources
 Social Media/Online Web
 Third-Party Cloud Storage
 Transactional Data
 Internet of Things (Machine Data)
INTERNET OF THINGS
 Internet of Things (IoT) is made up of heterogeneous sensors and devices that work together to
make the humans’ lives more intelligent. These devices work together by sharing the collected
information about the environment.
 The Internet of Things is a system of interrelated computing devices, mechanical and digital
machines, objects, animals or people that are provided with unique identifiers and the ability to
transfer data over a network without requiring human-to-human or human-to-computer
interaction
 IoT Examples
 Examples of objects that can fall into the scope of Internet of Things include
connected security systems, thermostats, cars, electronic appliances, lights in
household and commercial environments, alarm clocks, speaker systems, vending
machines and more
 In a nutshell IoT wants to connect all potential objects to interact each other on the internet to provide
secure, comfort life for human . Recent researches shows by 2020 we have over 20 billion devices which
uses IoT.
TYPES OF DATA
 Structured Data
 Data that is the easiest to search and organize, because it is usually contained in rows
and columns and its elements can be mapped into fixed pre-defined fields, is known
as structured data.
 This makes structured data easy to store, analyze and search and until recently was
the only data easily usable for businesses. Today, most estimate structured data
accounts for less than 20 percent of all data.
 Unstructured Data
 A much bigger percentage of all the data is our world is unstructured data.
Unstructured data is data that cannot be contained in a row-column database and
doesn’t have an associated data model.
 The lack of structure made unstructured data more difficult to search, manage and
analyse, which is why companies have widely discarded unstructured data, until the
recent proliferation of artificial intelligence and machine learning algorithms made it
easier to process.
 Semi-Structured Data
 Beyond structured and unstructured data, there is a third category, which
basically is a mix between both of them. The type of data defined as
semi-structured data has some defining or consistent characteristics but
doesn’t conform to a structure as rigid as is expected with a relational
database.
THE BIG DATA PROCESS
 The life cycle of big data can be divided into four phases: (1)
collection; (2) compilation and consolidation; (3) data mining and
analytics; and (4) use.
DATA MINING AND DATA ANALYTICS
 Data Mining involves collecting data from various sources, while

Data Analytics is all about applying logical reasoning to it.
 Sorted and analysed data can uncover hidden patterns and insights that
can be useful in a variety of fields.
 Example: big data is a source of incredible business value for every
industry. For example pattern identification, foresight and many more.
PREDICTIVE ANALYTICS
 Predictive analytics comprise a variety of techniques that predict future

outcomes based on historical and current data.
 Predictive analytics can be applied to almost all disciplines–from predicting the
failure of jet engines based on the stream of data from several thousand sensors,
to predicting customers’ next moves based on what they buy, when they buy, and
even what they say on social media.
MACHINE LEARNING IN THE CONTEXT OF BIG DATA
 Machine Learning is defined as automated data processing and decision-making

algorithms designed to improve at every stage of their assigned task based on
their experience
 In the context of Big Data, Machine Learning is used to keep up or improvising by
itself with the ever-growing and ever-changing stream of data and deliver
continuously evolving and valuable insights
CONCLUDING
 Summary
 .Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.
 Examples of Big Data generation includes stock exchanges, social media sites, jet engines, etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Characteristics of Big Data
 Improved customer service, better operational efficiency, Better Decision Making are few
advantages of Big Data
THE DARK SIDE OF BIG DATA
DATA DISCRIMINATION
BIG DATA- A TOOL FOR EXCLUSION OR INCLUSION
 Big data can produce tremendous benefits for society, such as advances in
medicine, education, health, and transportation, and in many instances, without
using consumers’ personally identifiable information.
 Big data also can allow companies to improve their offerings, provide consumers
with personalized goods and services, and match consumers with products they
are likely to be interested in.
 At the same time, advocates, academics, and others have raised concerns about
whether certain uses of big data analytics may harm consumers.
 For example, if big data analytics incorrectly predicts that particular consumers
are not likely to respond to prime credit offers, certain types of educational
opportunities, or job openings requiring a college degree, companies may miss a
chance to reach individuals that desire this information.
EXPOSE SENSITIVE INFORMATION
 One study combined data on Facebook “Likes” and limited survey

information to determine that researchers could accurately predict a male
user’s sexual orientation 88 percent of the time; a user’s ethnic origin 95
percent of time; and whether a user was Christian or Muslim (82 percent),
a Democrat or Republican (85percent), or used alcohol, drugs, or
cigarettes (between 65 percent and 75 percent)
 M Kosinski · 2013
CREATE NEW JUSTIFICATIONS FOR EXCLUSION
 Big data analytics may give companies new ways to attempt to justify their exclusion of certain
populations from particular opportunities. For example, one big data analytics study showed that
“people who fill out online job applications using browsers that did not come with the computer .
. . but had to be deliberately installed perform better and change jobs less often.”
 If an employer were to use this correlation to refrain from hiring people who used a
particular browser, they could be excluding qualified applicants for reasons unrelated to
the job at issue.
 FTC, 2016
UNDERREPRESENTATION AND OVERREPRESENTATION
 Bias in the data collection can present itself as

an underrepresentation of specific groups and/or protected classes in
the data set, which might result in unfair or unequal treatment,
 or also an overrepresentation in the data set which might result in a
“disproportioned attention to a protected class group, and the increased
scrutiny may lead to a higher probability of observing a target
transgression”
OVERFITTING
 Overfitting- where “models may become too specialized or specific to the data
used for training” and, instead of finding the best possible decision rule overall,
they simply learn the most suited rule to the training data thus perpetrating its
bias.
 Another possible algorithmic cause of discriminatory outcomes is proxies for
protected characteristics such as race and gender. A historically recognized
proxy for race, for example, is ZIP or post-code and “redlining” is defined as the
systematic disadvantaging of specific, often racially associated, neighborhoods or
communities
DIGITAL DIVIDE
 big data exclusions” referring to those individuals “whose

information is not regularly collected or analyzed because
they do not routinely engage in data-generating practices”
BIG DATA RICH AND BIG DATA POOR
 Creation of new digital divides, arguing that discrimination may arise due to
 (1) differences in information access and processing skills—the Big Data rich and the Big
Data poor, and due to
 (2) gender differences insofar most researchers with computational skills are men
 (Boyd and Crawford)
 How the commercialization of predictive models will leave out vulnerable categories such
people with disabilities or limited decision-making capacities and high-risk patients.
 (Cohen et al)
GOOGLE FLU TRENDS- HOW FAULTY CAN ALGORITHMS BE?
 A prime example that demonstrates the limitations of big data analytics is Google Flu
Trends, a machine learning algorithm for predicting the number of flu cases based on
Google search terms. To predict the spread of influenza across the United States, the
Google team analyzed the top fifty million search terms for indications that the flu
had broken out in particular locations. While, at first, the algorithms appeared to
create accurate predictions of where the flu was more prevalent, it generated highly
inaccurate estimates over time.
 This could be because the algorithm failed to take into account certain variables. For
example, the algorithm may not have taken into account that people would be more
likely to search for flu-related terms if the local news ran a story on a flu outbreak,
even if the outbreak occurred halfway around the world. As one researcher has noted,
Google Flu Trends demonstrates that a “theory-free analysis of mere correlations is
inevitably fragile.
CIVIL RIGHTS ISSUE OF THE 21ST CENTURY
 Algorithmic decision-making is the civil rights issue of the 21st

century
 And constant work needs to be done to address it

CONSIDERATIONS
 How representative is your data set?

 . Does your data model account for biases?
 How accurate are your predictions based on big data?
Does your reliance on big data raise ethical or
fairness concerns?
CONCLUDING
 Big data will continue to grow in importance, and it is undoubtedly

improving the lives of underserved communities in areas such as
education, health, local and state services, and employment.
 Collective challenge is to make sure that big data analytics continue
to provide benefits and opportunities to consumers while adhering to
core consumer protection values and principles.

Big Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data

Uploaded by

Copyright:

Available Formats

BIG DATA

 Big Data is also data but with a huge size.

 Data Mining involves collecting data from various sources, while

 Predictive analytics comprise a variety of techniques that predict future

 Machine Learning is defined as automated data processing and decision-making

 One study combined data on Facebook “Likes” and limited survey

 Bias in the data collection can present itself as

 big data exclusions” referring to those individuals “whose

 Algorithmic decision-making is the civil rights issue of the 21st

 And constant work needs to be done to address it

 How representative is your data set?

 Big data will continue to grow in importance, and it is undoubtedly

You might also like