UNIT1

UNIT 1
UNDERSTANDING BIG DATA
Introduction to big data – convergence of key trends – unstructured data – industry

examples of big data – web analytics – big data applications- big data technologies –
introduction to Hadoop – open source technologies – cloud and big data – mobile
business intelligence – Crowd sourcing analytics – inter and Trans firewall analytics.
INTRODUCTION TO BIG DATA

 Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, techniques and frameworks.
 Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
 Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
 Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data − the power grid data holds information consumed by a particular node
with respect to a base station.
 Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
 Structured data − Relational data.
 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.
CONVERGENCE OF KEY TRENDS
Big Data that is currently defined by three dimensions:
1. Volume
2. Variety
3. Velocity
1. Volume
 Volume can be measured by the sheer quantity of transactions, events, or amount of

history that creates the data volume, but the volume is often further exacerbated by the
attributes, dimensions, or predictive variables.
 Analytics have used smaller data sets called samples to create predictive models.
Oftentimes, the business use case or predictive insight has been severely blunted since the
data volume has purposely been limited due to storage or computational processing
constraints.
2. Variety
 Data variety is the assortment of data. Traditionally data, especially operational data, is
“structured” as it is put into a database based on the type of data (i.e., character, numeric,
fl oating point, etc.).
 Over the past couple of decades, data has increasingly become “unstructured” as the
sources of data beyond operational applications. Oftentimes, text, audio, video, image,
geospatial, and Internet data (including click streams and log fi les) are considered
unstructured data.
 However, since many of the sources of this data are programs the data is in actuality
“semistructured.” Semi-structured data is often a combination of different types of data
that has some pattern or structure that is not as strictly defi ned as structured data. For
example
call center logs may contain customer name + date of call + complaint where the
complaint information is unstructured and not easily synthesized into a data store.
3. Velocity
 Data velocity is about the speed at which data is created, accumulated, ingested, and
processed.
 The increasing pace of the world has put demands on businesses to process information
in real-time or with near real-time responses.
 This may mean that data is processed on the fly or while “streaming” by to make quick,
real-time decisions or it may be that monthly batch processes are run interday to produce
more timely decisions.
A Wider Variety of Data
 The variety of data sources continues to increase. Traditionally, internally focused

operational systems, such as ERP (enterprise resource planning) and CRM applications,
were the major source of data used in analytic processing.
variety of data sources such as:
■ Internet data (i.e., clickstream, social media, social networking links)
■ Primary research (i.e., surveys, experiments, observations)
■ Secondary research (i.e., competitive and marketplace data, industry reports, consumer data,
business data)
■ Location data (i.e., mobile device data, geospatial data)
■ Image data (i.e., video, satellite image, surveillance)
■ Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
■ Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)
UNSTRUCTURED DATA
 Unstructured data is basically information that either does not have a predefi ned data
model and/or does not fi t well into a relational database.
 Unstructured information is typically text heavy, but may contain data such as dates,
numbers, and facts as well. The term semi-structured data is used to describe structured
data that doesn ’t fi t into a formal structure of data models.
 However, semi-structured data does contain tags that separate semantic elements, which
includes the capability to enforce hierarchies within the data.
Characteristics of Unstructured Data:
 Data neither conforms to a data model nor has any structure.
 Data cannot be stored in the form of rows and columns as in Databases
 Data does not follows any semantic or rules
 Data lacks any particular format or sequence
 Data has no easily identifiable structure
 Due to lack of identifiable structure, it can not used by computer programs easily
Sources of Unstructured Data:
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys
Advantages of Unstructured Data:
 ts supports the data which lacks a proper format or sequence

 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable
 It is very scalable
 It can deal easily with the heterogeneity of sources.
 These type of data have a variety of business intelligence and analytics applications.
Disadvantages Of Unstructured data:
 It is difficult to store and manage unstructured data due to lack of schema and structure
 Indexing the data is difficult and error prone due to unclear structure and not having pre-
defined attributes. Due to which search results are not very accurate.
 Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
 It requires a lot of storage space to store unstructured data.
 It is difficult to store videos, images, audios, etc.
 Due to unclear structure, operations like update, delete and search is very difficult.
 Storage cost is high as compared to structured data
 Indexing the unstructured data is difficult
WEB ANALYTICS
 Web analytics is the process of analyzing the behavior of visitors to a website. This
involves tracking, reviewing and reporting data to measure web activity, including the
use of a website and its components, such as webpages, images and videos.
 Data collected through web analytics may include traffic sources, referring sites, page
views, paths taken and conversion rates. The compiled data often forms a part of
customer relationship management analytics (CRM analytics) to facilitate and streamline
better business decisions.
 Web analytics enables a business to retain customers, attract more visitors and increase
the dollar volume each customer spends.
Analytics can help in the following ways:
 Determine the likelihood that a given customer will repurchase a product after purchasing it
in the past.
 Personalize the site to customers who visit it repeatedly.
 Monitor the amount of money individual customers or specific groups of customers spend.
 Observe the geographic regions from which the most and the least customers visit the site
and purchase specific products.
 Predict which products customers are most and least likely to buy in the future.
 The objective of web analytics is to serve as a business metric for promoting specific
products to the customers who are most likely to buy them and to determine which
products a specific customer is most likely to purchase. This can help improve the ratio of
revenue to marketing costs.
 In addition to these features, web analytics may track the clickthrough and drilldown
behavior of customers within a website, determine the sites from which customers most
often arrive, and communicate with browsers to track and analyze online behavior.
 The results of web analytics are provided in the form of tables, charts and graphs.
Follow
these steps as part of the web analytics processes.
Web analytics process
The web analytics process involves the following steps:
1. Setting goals. The first step in the web analytics process is for businesses to determine goals
and the end results they are trying to achieve. These goals can include increased sales,
customer satisfaction and brand awareness. Business goals can be both quantitative
and qualitative.
2. Collecting data. The second step in web analytics is the collection and storage of data.
Businesses can collect data directly from a website or web analytics tool, such as Google
Analytics. The data mainly comes from Hypertext Transfer Protocol requests -- including
data at the network and application levels -- and can be combined with external data to
interpret web usage. For example, a user's Internet Protocol address is typically associated
with many factors, including geographic location and clickthrough rates.
3. Processing data. The next stage of the web analytics funnel involves businesses processing
the collected data into actionable information.
4. Identifying key performance indicators (KPIs). In web analytics, a KPI is a quantifiable

measure to monitor and analyze user behavior on a website. Examples include bounce
rates, unique users, user sessions and on-site search queries.
5. Developing a strategy. This stage involves implementing insights to formulate strategies

that align with an organization's goals. For example, search queries conducted on-site can
help an organization develop a content strategy based on what users are searching for on its
website.
6. Experimenting and testing. Businesses need to experiment with different strategies in order
to find the one that yields the best results. For example, A/B testing is a simple strategy to
help learn how an audience responds to different content. The process involves creating two
or more versions of content and then displaying it to different audience segments to reveal
which version of the content performs better.
What are the two main categories of web analytics?
The two main categories of web analytics are off-site web analytics and on-site web analytics.
Off-site web analytics
 The term off-site web analytics refers to the practice of monitoring visitor activity outside
of an organization's website to measure potential audience. Off-site web analytics
provides an industrywide analysis that gives insight into how a business is performing in
comparison to competitors.
 It refers to the type of analytics that focuses on data collected from across the web, such
as social media, search engines and forums.
On-site web analytics
 On-site web analytics refers to a narrower focus that uses analytics to track the activity of
visitors to a specific site to see how the site is performing.
 The data gathered is usually more relevant to a site's owner and can include details on site
engagement, such as what content is most popular.
 Two technological approaches to on-site web analytics include log file analysis and page
tagging.
 Log file analysis, also known as log management, is the process of analyzing data
gathered from log files to monitor, troubleshoot and report on the performance of a
website.
 Log files hold records of virtually every action taken on a network server, such as a web
server, email server, database server or file server.
 Page tagging is the process of adding snippets of code into a website's HyperText
Markup Language code using a tag management system to track website visitors and
their interactions across the website.
 These snippets of code are called tags. When businesses add these tags to a website, they
can be used to track any number of metrics, such as the number of pages viewed, the
number of unique visitors and the number of specific products viewed.
Web analytics tools
 Web analytics tools report important statistics on a website, such as where visitors came
from, how long they stayed, how they found the site and their online activity while on the
site. In addition to web analytics, these tools are commonly used for product
analytics, social media analytics and marketing analytics.

INDUSTRY EXAMPLES OF BIG DATA & BIG DATA APPLICATIONS
1.Retail
 Good customer service and building customer relationships is vital in the retail industry. The
best ways to build and maintain this service and relationship is through big data analysis.
 Retail companies need to understand the best techniques to market their products to their
customers, the best process to manage transactions and the most efficient and strategic way to
bring back lapsed customers in such a competitive industry.
2.) Manufacturing
 Manufacturers can use big data to boost their productivity whilst also minimising wastage
and costs - processes which are welcomed in all sectors but vital within manufacturing.
 There has been a large cultural shift by many manufacturers to embrace analytics in order to
make more speedy and agile business decisions.
3.) Education
 Schools and colleges which use big data analysis can make large positive differences to the
education system, its employees and students.
 By analyzing big data, schools are supplied with the intel needed to implement a better
system for evaluation and support of teachers, to make sure students are progressing and
identifying at risk .
4. Transportation
 Big Data powers the GPS smartphone applications most of us depend on to get from
place to place in the least amount of time. GPS data sources include satellite images and
government agencies.
 Airplanes generate enormous volumes of data, on the order of 1,000 gigabytes for
transatlantic flights. Aviation analytics systems ingest all of this to analyze fuel
efficiency, passenger and cargo weights, and weather conditions, with a view toward
optimizing safety and energy consumption.
 Big Data simplifies and streamlines transportation through: Congestion management and
traffic control Thanks to Big Data analytics, Google Maps can now tell you the least
traffic-prone route to any destination.
 Route planning Different itineraries can be compared in terms of user needs, fuel
consumption, and other factors to plan for maximize efficiency. Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-prone areas.
5.Advertising and Marketing
 Ads have always been targeted towards specific consumer segments. In the past,
marketers have employed TV and radio preferences, survey responses, and focus groups
to try to ascertain people’s likely responses to campaigns. At best, these methods
amounted to educated guesswork.
 Today, advertisers buy or gather huge quantities of data to identify what consumers
actually click on, search for, and “like.” Marketing campaigns are also monitored for
effectiveness using click-through rates, views, and other precise metrics.
For example, Amazon accumulates massive data stories on the purchases, delivery methods, and
payment preferences of its millions of customers. The company then sells ad placements that can
be highly targeted to very specific segments and subgroups.
6.Banking and Financial Services

 The financial industry puts Big Data and analytics to highly productive use, for:Fraud
detection Banks monitor credit cardholders’ purchasing patterns and other activity to flag
atypical movements and anomalies that may signal fraudulent transactions.
 Risk management Big Data analytics enable banks to monitor and report on operational
processes, KPIs, and employee activities.
 Customer relationship optimization Financial institutions analyze data from website
usage and transactions to better understand how to convert prospects to customers and
incentivize greater use of various financial products.
 Personalized marketing Banks use Big Data to construct rich profiles of individual
customer lifestyles, preferences, and goals, which are then utilized for micro-targeted
marketing initiatives.
7.Government
 Government agencies collect voluminous quantities of data, but many, especially at the
local level, don’t employ modern data mining and analytics techniques to extract real
value from it.
 Examples of agencies that do include the IRS and the Social Security Administration,
which use data analysis to identify tax fraud and fraudulent disability claims. The FBI
and SEC apply Big Data strategies to monitor markets in their quest to detect criminal
business activities. For years now, the Federal Housing Authority has been using Big
Data analytics to forecast mortgage default and repayment rates.
 The Centers for Disease Control tracks the spread of infectious illnesses using data from
social media, and the FDA deploys Big Data techniques across testing labs to investigate
patterns of foodborne illness. The U.S. Department of Agriculture supports agribusiness
and ranching by developing Big Data-driven technologies.
 Military agencies, with expert assistance from a sizable ecosystem of defense contractors,
make sophisticated and extensive use of data-driven insights for domestic intelligence,
foreign surveillance, and cybersecurity.
8. Media and Entertainment
 The entertainment industry harnesses Big Data to glean insights from customer reviews,
predict audience interests and preferences optimize programming schedules, and target
marketing campaigns.
 Two conspicuous examples are Amazon Prime, which uses Big Data analytics to
recommend programming for individual users, and Spotify, which does the same to offer
personalized music suggestions.
9. Meteorology
 Weather satellites and sensors all over the world collect large amounts of data for
tracking environmental conditions. Meteorologists use Big Data to:Study natural disaster
patterns Prepare weather forecasts Understand the impact of global warming Predict the
availability of drinking water in various world regions Provide early warning of
impending crises such as hurricanes and tsunamis
10.Healthcare
 Big Data is slowly but surely making a major impact on the huge healthcare industry.
Wearable devices and sensors collect patient data which is then fed in real-time to
individuals’ electronic health records. Providers and practice organizations are now using
Big Data for a number of purposes, including these:
 Prediction of epidemic outbreaks
 Early symptom detection to avoid preventable diseases
 Electronic health records
 Real-time alerting
 Enhancing patient engagement
 Prediction and prevention of serious medical conditions
 Strategic planning
 Research acceleration
 Telemedicine
 Enhanced analysis of medical images
11.Education
 Administrators, faculty, and stakeholders are embracing Big Data to help improve their
curricula, attract the best talent, and optimize the student experience.
 Examples include:Customizing curriculam Big Data enables academic programs to be
tailored to the needs of individual students, often drawing on a combination of online
learning, traditional on-site classes, and independent study.
 Reducing dropout rates Predictive analytics give educational institutions insights on
student results, responses to proposed programs of study, and input on how students fare
in the job market after graduation.
 Improving student outcomes Analyzing students’ personal “data trails” can provide a
better understanding of their learning styles and behaviors, and be used to create an
optimal learning environment.
 Targeted international recruiting Big Data analysis helps institutions more accurately
predict applicants’ likely success. Conversely, it aids international students in pinpointing
the schools best matched to their academic goals and most likely to admit them.
BIG DATA TECHNOLOGIES
Types of Big Data Technology
 Before we start with the list of big data technologies, let us first discuss this technology's
board classification. Big Data technology is primarily classified into the following two
types:
Operational Big Data Technologies
 This type of big data technology mainly includes the basic day-to-day data that people
used to process. Typically, the operational-big data includes daily basis data such as
online transactions, social media platforms
 the data from any particular organization or a firm, which is usually needed for analysis
using the software based on big data technologies. The data can also be referred to as raw
data used as the input for several Analytical Big Data Technologies.
Some specific examples that include the Operational Big Data Technologies can be listed as
below:
o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart, Walmart,
etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Analytical Big Data Technologies
 Analytical Big Data is commonly referred to as an improved version of Big Data

Technologies. This type of big data technology is a bit complicated when compared with
operational-big data.
 Analytical big data is mainly used when performance criteria are in use, and important
real-time business decisions are made based on reports created by analyzing operational-
real data.
 This means that the actual investigation of big data that is important for business
decisions falls under this type of big data technology.
Some common examples that involve the Analytical Big Data Technologies can be listed as
below:
o Stock marketing data

o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the health status of an
individual
o Carrying out the space mission databases where every information of a mission is very
important
Top Big Data Technologies
We can categorize the leading big data technologies into the following four sections:
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:
1.Hadoop: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information.
 Also, it is capable enough to process tasks in batches. The Hadoop framework was
mainly introduced to store and process data in a distributed data processing environment
parallel to commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
2.MongoDB: MongoDB is another important component of big data technologies in

terms of storage. No relational properties and RDBMS properties apply to MongoDb
because it is a NoSQL database.
 This is not the same as traditional RDBMS databases that use structured query languages. Instead,
MongoDB uses schema documents.The structure of the data storage in MongoDB is also different
from traditional RDBMS databases.
 This enables MongoDB to hold massive amounts of data. It is based on a simple cross-
platform document-oriented design. The database in MongoDB uses documents similar
to JSON with the schema.
 This ultimately helps operational data storage options, which can be seen in most
financial organizations. As a result, MongoDB is replacing traditional mainframes and
offering the flexibility to handle a wide range of high-volume data-types in distributed
architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
3.RainStor: RainStor is a popular database management system designed to manage and

analyze organizations' Big Data requirements.
It uses deduplication strategies that help manage storing and handling vast amounts of
data for reference. RainStor was designed in 2004 by a RainStor Software Company. It
operates just like SQL. Companies such as Barclays and Credit Suisse are using RainStor
for their big data needs.
4.Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop
clusters using virtual indexes.
 This helps us to use the spunk search processing language to analyze data. Also, Hunk
allows us to report and visualize vast amounts of data from Hadoop and NoSQL data
sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
5.Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
Data Mining
Let us now discuss leading Big Data Technologies that come under Data Mining:
1.Presto: Presto is an open-source and a distributed SQL query engine developed

to run interactive analytical queries against huge-sized data sources.
 The size of data sources can vary from gigabytes to petabytes. Presto helps in
querying the data in Cassandra, Hive, relational databases and proprietary data
storage systems.
 Presto is a Java-based query engine that was developed in 2013 by the Apache
Software Foundation. Companies like Repro, Netflix, Airbnb, Facebook and
Checkr are using this big data technology and making good use of it.
2.RapidMiner: RapidMiner is defined as the data science software that offers us
a very robust and powerful graphical user interface to create, deliver, manage, and
maintain predictive analytics.
 Using RapidMiner, we can create advanced workflows and scripting support in a
variety of programming languages.
 RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning
Environment). A few sets of companies that are making good use of the
RapidMiner tool are Boston Consulting Group, InFocus, Domino's, Slalom, and
Vivint.SmartHome.
3.Elastic Search: When it comes to finding information, elasticsearch is known as

an essential tool. It typically combines the main components of the ELK stack (i.e.,
Logstash and Kibana).
 In simple words, ElasticSearch is a search engine based on the Lucene library and
works similarly to Solr. Also, it provides a purely distributed, multi-tenant capable
search engine.
 This search engine is completely text-based and contains schema-free JSON
documents with an HTTP web interface.ElasticSearch is primarily written in a
Java programming language and was developed in 2010 by Shay Banon. Now, it
has been handled by Elastic NV since 2012.
 ElasticSearch is used by many top companies, such as LinkedIn, Netflix,
Facebook, Google, Accenture, StackOverflow, etc.
Data Analytics
Now, let us discuss leading Big Data Technologies that come under Data Analytics:
o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform
is primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.Besides, Kafka also provides a retention period, and data can
be transmitted through a producer-consumer mechanism. Kafka has received many
enhancements to date and includes some additional levels or properties,
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data technologies.
It is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like Amazon,
ORACLE, CISCO, VerizonWireless, and Hortonworks are using this big data technology
and making good use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications related
to different industries, such as finance, supply chain, manufacturing, etc. It is primarily
used in processing operations like payments and escrow. This helps in reducing the risks
of fraud. Besides, it enhances the transaction's overall processing speed, increases
financial privacy, and internationalize the markets. Additionally, it is also used to fulfill
the needs of shared ledger, smart contract, privacy, and consensus in any Business
Network Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.
Data Visualization
Let us discuss leading Big Data Technologies that come under Data Visualization:
o Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this
tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.
Emerging Big Data Technologies
Apart from the above mentioned big data technologies, there are several other emerging big data
technologies. The following are some essential technologies among them:
o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible ecosystem

tools, and community resources that help researchers implement the state-of-art in
Machine Learning. Besides, this ultimately allows developers to build and deploy
machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on C++,
CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using this
technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It is
written in Python and Java. Some leading companies like Amazon, ORACLE, Cisco, and
VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy, and
execute applications easier by using containers. Containers usually help developers pack
up applications properly, including all the required components like libraries and
dependencies. Typically, containers bind all components and ship them all together as a
package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define workflows
in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is based
on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing Foundation.
It is written in the Go language. Companies like American Express, Pear Deck,
PeopleSource, and Northwestern Mutual are making good use of this technology.
HADOOP
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale up
from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at the
application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on
all the platforms since it is Java based.
open source technologies
Apache Spark
Originally developed by Matel Zaharia in the AMPLab at UC Berkeley, Apache Spark is an open
source Hadoop processing engine that is an alternative to Hadoop MapReduce. Spark uses in-
memory primitives that can improve performance by up to 100X over MapReduce for certain
applications. It is well-suited to machine learning algorithms and interactive analytics. Spark
consists of multiple components: Spark Core and Resilient Distributed Datasets (RDDs), Spark
SQL, Spark Streaming, MLlib Machine Learning Library and GraphX. Spark is a top-level Apache
project.
Apache Storm
Written primarily in the Clojure programming language, Apache Storm is another distributed
computation framework alternative to MapReduce geared to real-time processing of streaming
data. It is well suited to real-time data integration and applications involving streaming analytics
and event log monitoring. It was originally created by Nathan Marz and his team at BackType,
before it was acquired by Twitter and released to open source. Storm applications are designed as a
“topology” that acts as a data transformation pipeline. Storm is a top-level Apache project.
Apache Ranger
Apache Ranger is a framework for enabling, monitoring and managing comprehensive data
security across the Hadoop platform. Based on technology from big data security specialist XA
Secure, Apache Ranger was made an Apache Incubator project after Hadoop distribution vendor
Hortonworks acquired that company. Ranger offers a centralized security framework to manage
fine-grained access control over Hadoop and related components (like Apache Hive, HBase, etc.).
It also can enable audit tracking and policy analytics
0 seconds of 32 minutes, 40 secondsVolume 0%
Apache Knox Gateway
Apache Knox Gateway is a REST API Gateway that provides a single secure access point for all
REST interactions with Hadoop clusters. In that way, it helps in the control, integration,
monitoring and automation of critical administrative and analytical needs of the enterprise. It also
complements Kerberos secured Hadoop clusters. Knox is an Apache Incubator project.
Apache Kafka
Apache Kafka, originally developed by LinkedIn, is an open source fault-tolerant publish-

subscribe message broker written in Scala. Kafka works in combination with Apache Storm,
Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. It’s
ability to broker massive message streams for low-latency analysis — like messaging geospatial
data from a fleet of long-haul trucks or sensor data from heating and cooling equipment — makes
it useful for Internet of Things applications. Kafka is a top-level Apache project.
Apache Nifi
Born from a National Security Agency (NSA) project, Apache Nifi is a top-level Apache project
for orchestrating data flows from disparate data sources. It aggregates data from sensors, machines,
geo location devices, clickstream files and social feeds via a secure, lightweight agent. It also
mediates secure point-to-point and bidirectional data flows and allows the parsing, filtering,
joining, transforming, forking or cloning of data streams. Nifi is designed to integrate with Kafka
as the building blocks of real-time predictive analytics applications leveraging the Internet of
Things.
Apache Hadoop
Apache Hadoop is an open source software framework for data-intensive distributed applications
originally created by Doug Cutting to support his work on Nutch, an open source Web search
engine. To meet Nutch’s multimachine processing requirements, Cutting implemented a
MapReduce facility and a distributed file system that together became Hadoop. He named it after
his son’s toy elephant. Through MapReduce, Hadoop distributes Big Data in pieces over a series of
nodes running on commodity hardware. Hadoop is now among the most popular technologies for
storing the structured, semi-structured and unstructured data that comprise Big Data. Hadoop is
available under the Apache License 2.0.
R is an open source programming language and software environment designed for statistical
computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand beginning in 1993 and is rapidly becoming the go-to tool for
statistical analysis of very large data sets. It has been commercialized by a company called
Revolution Analytics, which is pursuing a services and support model inspired by Red Hat’s
support for Linux. R is available under the GNU General Public License.
Cascading
An open source software abstraction layer for Hadoop, Cascading allows users to create and
execute data processing workflows on Hadoop clusters using any JVM-based language. It is
intended to hide the underlying complexity of MapReduce jobs. Cascading was designed by Chris
Wensel as an alternative API to MapReduce. It is often used for ad targeting, log file analysis,
bioinformatics, machine learning, predictive analytics, Web content mining and ETL applications.
Commercial support for Cascading is offered by Concurrent, a company founded by Wensel after
he developed Cascading. Enterprises that use Cascading include Twitter and Etsy. Cascading is
available under the Apache License.
Scribe
Scribe is a server developed by Facebook and released in 2008. It is intended for aggregating log
data streamed in real time from a large number of servers. Facebook designed it to meet its own
scaling challenges, and it now uses Scribe to handle tens of billions of messages a day. It is
available under the Apache License 2.0.
ElasticSearch
Developed by Shay Banon and based upon Apache Lucene, ElasticSearch is a distributed,
RESTful open source search server. It’s a scalable solution that supports near real-time search and
multitenancy without a special configuration. It has been adopted by a number of companies,
including StumbleUpon and Mozilla. ElasticSearch is available under the Apache License 2.0.
Apache HBase
Written in Java and modeled after Google’s BigTable, Apache HBase is an open source, non-
relational columnar distributed database designed to run on top of Hadoop Distributed Filesystem
(HDFS). It provides fault-tolerant storage and quick access to large quantities of sparse data.
HBase is one of a multitude of NoSQL data stores that have become available in the past several
years. In 2010, Facebook adopted HBase to serve its messaging platform. It is available under the
Apache License 2.0.
Apache Cassandra
Another NoSQL data store, Apache Cassandra is an open source distributed database management
system developed by Facebook to power its Inbox Search feature. Facebook abandoned Cassandra
in favor of HBase in 2010, but Cassandra is still used by a number of companies, including Netflix,
which uses Cassandra as the back-end database for its streaming services. Cassandra is available
under the Apache License 2.0.
MongoDB
Created by the founders of DoubleClick, MongoDB is another popular open source NoSQL data
store. It stores structured data in JSON-like documents with dynamic schemas called BSON (for
Binary JSON). MongoDB has been adopted by a number of large enterprises, including MTV
Networks, craigslist, Disney Interactive Media Group, The New York Times and Etsy. It is
available under the GNU Affero General Public License, with language drivers available under an
Apache License. The company 10gen offers commercial MongoDB licenses.
Apache CouchDB
Apache CouchDB is still another open source NoSQL database. It uses JSON to store data,
JavaScript as its query language and MapReduce and HTTP for an API. CouchDB was created in
2005 by former IBM Lotus Notes developer Damien Katz as a storage system for a large scale
object database. The BBC uses CouchDB for its dynamic content platforms, while Credit Suisse’s
commodities department uses it to store configuration details for its Python market data
framework. CouchDB is available under the Apache License 2.0.
Cloud Computing
Simply described, cloud computing is the distribution of computer services via the internet
(often known as "the cloud") to enable speedier innovation, flexible resources, and economies of
scale. These services include servers, storage, databases, networking, software, analytics, and
intelligence. Usually, you will only be charged for the cloud services that you actually use. This
can help you lower your operational costs, make your infrastructure run more efficiently, and
scale up or down as your business's needs change.
Services Provided by Cloud Computing
Cloud computing represents a significant paradigm change from the conventional approach that
firms use to think about their information technology resources. Here are seven of the most
common reasons why companies are turning to cloud computing services −
 Performance − Popular cloud computing services are backed by a worldwide
infrastructure of reliable data centres updated with the latest hardware to maximise speed
and efficiency. This reduces network latency for applications and increases economies of
scale compared to a single datacentre.
 Speed − Most cloud services are self-service and on-demand. Even enormous computer
resources may be deployed in minutes with a few mouse clicks. This gives the companies
flexibility and reduces their capacity planning stress.
 Cost − Cloud computing removes the need for upfront investments in hardware and
software, as well as the cost of constructing and maintaining on-premises data centres,
complete with racks of servers, round-the-clock electricity for power and cooling, and
information technology specialists to manage the infrastructure. The cost may quickly
build up.
 Global Scale − Using cloud services allows for elastic growth. In cloud computing, this
means supplying the right amount of computing power, storage space, and bandwidth as
needed from the right geographic location.
 Security − Numerous cloud service providers make available a comprehensive collection
of security policies, technologies, and controls that work together to improve your
organization's overall security posture. This, in turn, helps protect your data, apps, and
infrastructure from any potential dangers that may arise.
 Reliability − Because data may be duplicated at numerous redundant locations on the
network of the cloud provider, cloud computing enables data backup, disaster recovery,
and business continuity simpler and less costly.
Difference between Big Data and Cloud Computing
The following table highlights the major differences between Big Data and Cloud Computing −
Parameters of Big Data Cloud Computing

comparison
It refers to the processing of massive Utilization of computer services such as

amounts of data using a variety of storage, servers, software, networks, and
Basic
methods to organise, store, examine, and analytics is what this term refers to.
maintain the data.
The factors of volume, diversity, velocity, Adaptability, cost reduction,

authenticity, and value are all taken into independence from devices and locations,
Characteristics consideration. simplicity of maintenance, multitenancy,
greater productivity, and security are
some of the benefits.
Reduced costs and times, increased data It offers opportunities for innovation as
storage capacity, inventive product well as economies that are scalable and
Function creation, and effective decision-making resources that are adaptable. It allows for
are some of the benefits. the operation of the infrastructure in a
more effective and efficient manner.
There are three primary type of data Infrastructure as a Service (IaaS),

− structured data, unstructured data, and Platform as a Service (Paas), Software as
Type
semi-structured data. a Service (SaaS)", and Serverless are the
four primary categories.
The terms "big data" and "cloud computing" are often used interchangeably, yet they serve
distinct purposes. Both of these services are necessary components in the transmission,
processing, and transfer of data processes. They assure us that the transfer will be successful and
efficient. The integration and virtualization of resources is what makes Cloud Computing a
useful tool for Big Data.
MOBILE BUSINESS INTELLIGENCE
Mobile business intelligence is the transfer of business intelligence from the desktop to mobile
devices such as the BlackBerry, iPad, and iPhone.
The ability to access analytics and data on mobile devices or tablets rather than desktop
computers is referred to as mobile business intelligence. The business metric dashboard and key
performance indicators (KPIs) are more clearly displayed.
With the rising use of mobile devices, so have the technology that we all utilise in our daily lives
to make our lives easier, including business. Many businesses have benefited from mobile
business intelligence. Essentially, this post is a guide for business owners and others to educate
them on the benefits and pitfalls of Mobile BI.
Need for mobile BI?
Mobile phones' data storage capacity has grown in tandem with their useThe number of
businesses receiving assistance in such a situation is growing by the day.
To expand your business or boost your business productivity, mobile BI can help, and it works
with both small and large businesses. Mobile BI can help you whether you are a salesperson or a
CEO. There is a high demand for mobile BI in order to reduce information time and use that time
for quick decision making.
As a result, timely decision-making can boost customer satisfaction and improve an enterprise's
reputation among its customers. It also aids in making quick decisions in the face of emerging
risks.
Advantages of mobile BI
1. Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view your
data at any time and from any location. Having real-time visibility into a firm improves
production and the daily efficiency of the business. Obtaining a company's perspective
with a single click simplifies the process.
2. Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to stay
ahead of the competition. Easy access to real-time data improves company opportunities
and raises sales and capital. This also aids in making the necessary decisions as market
conditions change.
3. Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from any
location. During its demand, Mobile BI offers the information. This assists consumers in
obtaining what they require at the time. As a result, decisions are made quickly.
4. Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data
when they need it. Obtaining all of the corporate data with a single click frees up a
significant amount of time to focus on the smooth and efficient operation of the firm.
Increased productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1. Stack of data
The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information and
does end up with heaps of earlier data. The corporation only needs a small portion of the
previous data, but they need to store the entire information, which ends up in the stack
2. Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for their
expensive services, but small businesses cannot. As the cost of mobile BI is not sufficient,
we must additionally consider the rates of IT workers for the smooth operation of BI, as
well as the hardware costs involved.
However, larger corporations do not settle for just one Mobile BI provider for their
organisations; they require multiple. Even when doing basic commercial transactions,
mobile BI is costly.
3. Time consuming
Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse is used
to create the system, hence the implementation of BI in an enterprise takes more than 18
months.
4. Data breach
The biggest issue of the user when providing data to Mobile BI is data leakage. If you
handle sensitive data through Mobile BI, a single error can destroy your data as well as
make it public, which can be detrimental to your business.
Many Mobile BI providers are working to make it 100 percent secure to protect their
potential users' data. It is not only something that mobile BI carriers must consider, but it
is also something that we, as users, must consider when granting data access authorization.
(From)
5. Poor quality data
Because we work online in every aspect, we have a lot of data stored in Mobile BI, which
might be a significant problem. This means that a large portion of the data analysed by
Mobile BI is irrelevant or completely useless. This can speed down the entire procedure.
This requires you to select the data that is important and may be required in the future.
Best Mobile BI tools
1. Si Sense
Sisense is a flexible business intelligence (BI) solution that includes powerful analytics,
visualisations, and reporting capabilities for managing and supporting corporate data.
Businesses can use the solution to evaluate large, diverse databases and generate relevant
business insights. You may easily view enormous volumes of complex data with Si
Sense's code-first, low-code, and even no-code technologies. Si Sense was established in
2004 with its headquarters in New York.
Since then, the team has only taken precautionary steps in their investigation. Once the
company had received $ 4 million in funding from investors, they began to pace its
research.
2. SAP Roambi analytics
Roambi analytics is a BI tool that offers a solution that allows you to fundamentally
rethink your data analysis, making it easier and faster while also increasing your data
interaction.
You can consolidate all of your company's data in a single tool using SAP Roambi
Analytics, which integrates all ongoing systems and data. Use of SAP Roambi analysis is a
simple three-step technique. Upload your html or spreadsheet files first. The information is
subsequently transformed into informative data or graphs, as well as data that may be
visualised.
After the data is collected, you may easily share it with your preferred device. Roambi
Analytics was founded in 2008 by a team based in California.
3. Microsoft Power BI pro
Microsoft's strength BI is an easy-to-use tool for all non-technical business owners. who
are unfamiliar with BI tools but wish to aggregate, analyse, visualise, and share data you
only need a basic understanding of Excel and other Microsoft tools, and if you are familiar
with these, the Microsoft BI tool can be used as a self-service tool.
Microsoft Power BI has a unique feature that allows users to create subsets of data and
then automatically apply analytics to that information.
4. IBM Cognos Analytics
Cognos Analytics is an IBM-registered web-based business intelligence tool. Cognos

Analytics is now merging with Watsons, and the benefits for users are extremely exciting.
Watson cognos analytics will assist in connecting and cleaning the users' data, resulting in
proper visualised data.
That way, the business owner will know where they stand in comparison to their
competitors and where they can grow in the future. It combines reporting, modelling,
analysis, dashboards to help you understand your organization's data and make sound
business decisions.
5. Amazon quick sights
Amazon Quick View assists in the creation and distribution of interactive BI dashboards to
their users, as well as the retrieval of answers in natural language queries in seconds.
Quick sight can be accessed through any device embedded in any website, portal, or app.
Amazon Quick Sight allows you to quickly and easily create interactive dashboards and
reports for your users. Anyone in your organisation can securely access those dashboards
via browsers or mobile devices.
Quick sight's eye-catching feature is its pay-per-session model, which allows users to use
the creative dashboard created by another without paying much. The user pays according
to the length of the session, with prices ranging from $0.30 for a 30-minute session to $5
for unlimited use per month per user.
CROWD SOURCING ANALYTICS
Crowdsourcing is a sourcing model in which an individual or an organization gets

support from a large, open-minded, and rapidly evolving group of people in the form of
ideas, micro-tasks, finances, etc. Crowdsourcing typically involves the use of the
internet to attract a large group of people to divide tasks or to achieve a target. The term
was coined in 2005 by Jeff Howe and Mark Robinson. Crowdsourcing can help
different types of organizations get new ideas and solutions, deeper consumer
engagement, optimization of tasks, and several other things.
Let us understand this term deeply with the help of an example. Like GeeksforGeeks is
giving young minds an opportunity to share their knowledge with the world by
contributing articles, videos of their respective domain. Here GeeksforGeeks is using
the crowd as a source not only to expand their community but also to include ideas of
several young minds improving the quality of the content.
Where Can We Use Crowdsourcing?
Crowdsourcing is touching almost all sectors from education to health. It is not only
accelerating innovation but democratizing problem-solving methods. Some fields where
crowdsourcing can be used.
1. Enterprise
2. IT
3. Marketing
4. Education
5. Finance
6. Science and Health
How To Crowdsource?
1. For scientific problem solving, a broadcast search is used where an organization
mobilizes a crowd to come up with a solution to a problem.
2. For information management problems, knowledge discovery and management is
used to find and assemble information.
3. For processing large datasets, distributed human intelligence is used. The
organization mobilizes a crowd to process and analyze the information.
Examples Of Crowdsourcing
1. Doritos: It is one of the companies which is taking advantage of crowdsourcing for
a long time for an advertising initiative. They use consumer-created ads for one of
their 30-Second Super Bowl Spots(Championship Game of Football).
2. Starbucks: Another big venture which used crowdsourcing as a medium for idea
generation. Their white cup contest is a famous contest in which customers need to
decorate their Starbucks cup with an original design and then take a photo and
submit it on social media.
3. Lays:” Do us a flavor” contest of Lays used crowdsourcing as an idea-generating
medium. They asked the customers to submit their opinion about the next chip
flavor they want.
4. Airbnb: A very famous travel website that offers people to rent their houses or
apartments by listing them on the website. All the listings are crowdsourced by
people.
There are several examples of businesses being set up with the help of crowdsourcing.
Crowdsourced Marketing
As discussed already crowdsourcing helps grow businesses grow a lot. May it be a
business idea or just a logo design, crowdsourcing engages people directly and in turn,
saves money and energy. In the upcoming years, crowdsourced marketing will surely
get a boost as the world is accepting technology faster.
Crowdsourcing Sites
Here is the list of some famous crowdsourcing and crowdfunding sites.
1. Kickstarter
2. GoFundMe
3. Patreon
4. RocketHub
Advantages Of Crowdsourcing
1. Evolving Innovation: Innovation is required everywhere and in this advancing world
innovation has a big role to play. Crowdsourcing helps in getting innovative ideas
from people belonging to different fields and thus helping businesses grow in every
field.
2. Save costs: There is the elimination of wastage of time of meeting people and
convincing them. Only the business idea is to be proposed on the internet and you
will be flooded with suggestions from the crowd.
3. Increased Efficiency: Crowdsourcing has increased the efficiency of business
models as several expertise ideas are also funded.
Disadvantages Of Crowdsourcing
1. Lack of confidentiality: Asking for suggestions from a large group of people can
bring the threat of idea stealing by other organizations.
2. Repeated ideas: Often contestants in crowdsourcing competitions submit repeated,
plagiarized ideas which leads to time wastage as reviewing the same ideas is not
worthy.
INTER AND TRANS FIREWALL ANALYTICS
 Basically supply chains connected multiple companies and enable them to collaborate
to create enormous value to the end consumer. Decision science is witnessing a
similar trend as enterprises are beginning to collaborate on insights across the value
chain.
 For example, in the health care industry, rich consumer insights can be generated by
collaborating on data and insights from the health insurance provider, pharmacy
delivering the drugs, and the drug manufacturer. This is not necessarily limited to
companies within the traditional demand-supply value chain.
 For example, there are instances where a retailer and a social media company can
come together to share insights on consumer behavior. Some of the more progressive
companies are taking this a step further and working on leveraging the large volumes
of data outside the firewall such as social data, location data, and so forth.
 This trend is the move from intra- to inter- and trans-firewall analytics. Today
companies are doing intra-firewall analytics with data within the firewall. Tomorrow
they will be collaborating on insights with other companies to do inter-firewall
analytics as well as leveraging the public domain spaces to do trans-firewall analytics.
Challenges of Inter- and Trans-Firewall Analytics .
 First, as one moves outside the firewall, the information-to-noise ratio increases,
putting additional requirements on analytical methods and technology requirements.
Further, organizations are often limited by a fear of collaboration and an overreliance
on proprietary information.
 The fear of collaboration is mostly driven by competitive fears, data privacy
concerns, and proprietary orientations that limit opportunities for cross-organizational
learning and innovation.
 While it is clear that the transition to an inter- and trans-firewall paradigm is not easy,
feel it will continue to grow and at some point it will become a key weapon, available
for decisions scientists to drive disruptive value and efficiencies.

UNIT1

Uploaded by

Copyright:

Available Formats

You might also like

UNIT1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT1

Uploaded by

Copyright:

Available Formats

UNIT 1

UNDERSTANDING BIG DATA

Introduction to big data – convergence of key trends – unstructured data – industry

INTRODUCTION TO BIG DATA

Big Data that is currently defined by three dimensions:

 Volume can be measured by the sheer quantity of transactions, events, or amount of

A Wider Variety of Data

 The variety of data sources continues to increase. Traditionally, internally focused

variety of data sources such as:

■ Internet data (i.e., clickstream, social media, social networking links)

■ Primary research (i.e., surveys, experiments, observations)

■ Location data (i.e., mobile device data, geospatial data)

■ Image data (i.e., video, satellite image, surveillance)

■ Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

 ts supports the data which lacks a proper format or sequence

Analytics can help in the following ways:

 Personalize the site to customers who visit it repeatedly.

Web analytics process

The web analytics process involves the following steps:

4. Identifying key performance indicators (KPIs). In web analytics, a KPI is a quantifiable

5. Developing a strategy. This stage involves implementing insights to formulate strategies

What are the two main categories of web analytics?

Off-site web analytics

Web analytics tools

6.Banking and Financial Services

8. Media and Entertainment

Types of Big Data Technology

Operational Big Data Technologies

Analytical Big Data Technologies

 Analytical Big Data is commonly referred to as an improved version of Big Data

o Stock marketing data

Top Big Data Technologies

2.MongoDB: MongoDB is another important component of big data technologies in

3.RainStor: RainStor is a popular database management system designed to manage and

1.Presto: Presto is an open-source and a distributed SQL query engine developed

3.Elastic Search: When it comes to finding information, elasticsearch is known as

Emerging Big Data Technologies

o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible ecosystem

 Processing/Computation layer (MapReduce), and

open source technologies

0 seconds of 32 minutes, 40 secondsVolume 0%

Apache Knox Gateway

Apache Kafka, originally developed by LinkedIn, is an open source fault-tolerant publish-

Parameters of Big Data Cloud Computing

It refers to the processing of massive Utilization of computer services such as

The factors of volume, diversity, velocity, Adaptability, cost reduction,

There are three primary type of data Infrastructure as a Service (IaaS),

Need for mobile BI?

5. Poor quality data

2. SAP Roambi analytics

3. Microsoft Power BI pro

4. IBM Cognos Analytics

Cognos Analytics is an IBM-registered web-based business intelligence tool. Cognos

5. Amazon quick sights

Crowdsourcing is a sourcing model in which an individual or an organization gets

You might also like