Professional Documents
Culture Documents
Bba Unit-1
Bba Unit-1
UNIT-1
Data:
1. Composition: The composition of data deals with the structure of data, i.e;
the sources of data, the granularity, the types and nature of data as to
whether it is static or real time streaming.
2. Condition: The condition of data deals with the state of data, i.e; “Can one
use this data as is for analysis?” or “Does it require cleaning for further
enhancement and enrichment?” data?”
3. Context: The context of data deals with “Where has this data been
generated?”. “Why was this data generated?”, “How sensitive is this
data?”, “What are the events associated with this”.
2. Semi-Structured Data:
This data which doesn’t conform to a data model but has some structure.
Metadata for this data is available but is not sufficient.
Sources: XML, JSON, E-mail
Characteristics:
- inconsistent structure.
- self describing (label/value pairs)
- schema information is blended with data values
- data objectives may have different attributes not known before
Challenges:
Storage cost: Storing data with their schemas increases cost
RDBMS: Semi-structured data cannot be stored in existing RDBMS as
data cannot be mapped into tables directly
Irregular and partial structure: Some data elements may have extra
information while others none at all
Implicit structure: In many cases the structure is implicit.
Interpreting relationships and correlations is very difficult
Flat files: Semi-structured is usually stored in flat files which are
difficult to index and search
Heterogeneous sources: Data comes from varied sources which is difficult
to tag and search.
3. Unstructured Data:
This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program.
About 80–90% data of an organization is in this format.
Sources: memos, chat-rooms, PowerPoint presentations, images,
videos, letters, researches, white papers, body of an email, etc.
Characteristics:
Does not confirm to any data model
Can’t be stored in the form of rows and columns
Not in any particular format or sequence
Not easily usable by the program
Doesn’t follow any rule or semantics
Challenges:
Storage space: Sheer volume of unstructured data and its
unprecedented growth makes it difficult to store. Audios, videos,
images, etc. acquire huge amount of storage space
Scalability: Scalability becomes an issue with increase in
unstructured data
Retrieve information: Retrieving and recovering unstructured data are
cumbersome
Security: Ensuring security is difficult due to varied sources of data
(e.g. e-mail, web pages)
Update/delete: Updating, deleting, etc. are not easy due to the
unstructured form
Indexing and Searching: Indexing becomes difficult with increase in
data.
Searching is difficult for non-text data
Interpretation: Unstructured data is not easily interpreted by
conventional search algorithm
Tags: As the data grows it is not possible to put tags Manually
Indexing: Designing algorithms to understand the meaning
of the document and then tag or index them accordingly is difficult.
Big Data:
To get a handle on challenges of big data, you need to know what the word "Big Data" means. When we
hear "Big Data," we might wonder how it differs from the more common "data." The term "data" refers
to any unprocessed character or symbol that can be recorded on media or transmitted via electronic
signals by a computer. Raw data, however, is useless until it is processed somehow.
The files are divided into fixed-size chunks and distributed across the chunk servers or slave machines.
Features of the chunk servers include:
GFS is a file system designed to handle batch workloads with lots of data. The system is
distributed: multiple machines store copies of every file, and multiple machines try to
read/write the same file. GFS was originally designed for Google’s use case of searching and
indexing the web.
Components of GFS
A group of computers makes up GFS. A cluster is just a group of connected computers. There could be
hundreds or even thousands of computers in each cluster. There are three basic entities included in any
GFS cluster as follows:
GFS Clients: They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to the system.
GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions in
an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The chunks’
place in the overall file and which files they belong to are indicated by the metadata to the master server.
GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master
server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the
desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is referred to as one.
Overall Design
The system consists of a single master machine and many chunkservers. Appropriately named, these
chunkservers store parts of files called chunks (64MB data blocks), which are identified by a 64-bit
chunk handle.
The following diagram shows how the client interacts with the master and other chunkservers.
The 4 steps illustrate a simple read operation. In step 1, the client asks the master for metadata: where
are the replicas and locations where my data is stored? The metadata is returned in step 2, and then in
step 3, the client asks for one of the replicas (usually the closest one) for the actual data, which is
transferred in step 4.
Every client has to interact with the master. If the master started forwarding data to clients, bandwidth
would be used up quickly, and the system couldn’t serve many requests. By returning metadata,
interactions with the master are quick, while interactions with chunkservers are longer (which is fine,
since clients are probably talking to different chunkservers).
It has two objectives: maximize reliability/availability and maximize network utilization. For
reliability, data is replicated not only across machines, but also across server racks. Other criteria such as
equal disk space utilization try to ensure that no machine/rack is overloaded with requests. And when
these metrics get too skewed (e.g. a server rack is much more utilized than other racks), the master will
perform rebalancing, which includes reshuffling chunks across machines and racks.
Features of GFS
Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail. (replication) Component
failures are more common than not, as the saying goes.
2. Excessive throughput. many nodes operating concurrently.
3. Dependable storing. Data that has been corrupted can be found and duplicated.
Metadata
The master stores three major types of metadata: the file
and chunknamespaces, the mapping from files to chunks,
and the locations of each chunk’s replicas. All metadata is
kept in the master’s memory. The first two types (names
paces and file-to-chunkmapping) are also kept persistent by
logging mutations to an operation log stored on the mas
ter’s local diskand replicated on remote machines. Using
a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master
crash. The master does not store chunklocation informa
tion persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins
the cluster.
Chunk Size
Chunksize is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file sys
tem blocksizes. Each chunkreplica is stored as a plain
Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal
fragmentation, perhaps the greatest objection against such
a large chunksize
Big Data is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
Characteristics(V’s):
1.Volume: It refers to the amount of the data. The size of the data is being
increased from Bits to Yottabytes.
Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes->
Yottabytes
There are different sources of data like doc, pdf, YouTube, a chat
conversation on internet messenger, a customer feedback form on an online retail
website, CCTV coverage and weather forecast.
2.Variety: Variety deals with the wide range of data types and sources of data.
Structured, semi-structured and Unstructured.
Structured data: From traditional transaction processing systems and RDBMS,
etc.
Semi-structured data: For example Hypertext Markup Language (HTML),
eXtensible Markup Language (XML).
Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.
3.Velocity: It refers to the speed of data processing. we have moved from the days
of batch processing to Real-time processing.
4.Veracity: Veracity refers to biases, noise and abnormality in data. The key
question is “Is all the data that is being stored, mined and analysed meaningful
and pertinent to the problem under consideration”.
5.Value: This refers to the value that big data can provide, and it relates directly
to what organizations can do with that collected data. It is often quantified as the
potential social or economic value that the data might create.
Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is
in different formats) within legacy systems. Unstructured data cannot be stored in traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful information
from raw information. The input and output of information in unified formats continue to present
difficulties.
Security
Security is a big concern for organizations. Non-encrypted information is at risk of theft or damage by
cyber-criminals. Therefore, data security professionals must balance access to data against maintaining
strict security protocols.
Many of you are probably dealing with challenges related to poor data quality, but solutions are
available. The following are four approaches to fixing data problems:
Database sharding, memory caching, moving to the cloud and separating read-only and write-active
databases are all effective scaling methods. While each one of those approaches is fantastic on its own,
combining them will lead you to the next level.
These processes use familiar statistical analysis techniques—like clustering and regression—and apply
them to more extensive datasets with the help of newer tools.
Big data has been a buzz word since the early 2000s, when software and hardware capabilities made it
possible for organizations to handle large amounts of unstructured data.
Since then, new technologies—from Amazon to smartphones—have contributed even more to the
substantial amounts of data available to organizations. With the explosion of data, early innovation
projects like Hadoop, Spark, and NoSQL databases were created for the storage and processing of big
data.
This field continues to evolve as data engineers look for ways to integrate the vast amounts of complex
information created by sensors, networks, transactions, smart devices, web usage, and more. Even now,
big data analytics methods are being used with emerging technologies, like machine learning, to
discover and scale more complex insights.
1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations can
gather both structured and unstructured data from a variety of sources — from cloud storage to mobile
applications to in-store IoT sensors and beyond. Some data will be stored in data warehouses where
business intelligence tools and solutions can access it easily. Raw or unstructured data that is too diverse
or complex for a warehouse may be assigned metadata and stored in a data lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on analytical
queries, especially when it’s large and unstructured. Available data is growing exponentially, making
data processing a challenge for organizations.
One processing option is batch processing, which looks at large data blocks over time. Batch
processing is useful when there is a longer turnaround time between collecting and analyzing data.
Stream processing looks at small batches of data at once, shortening the delay time between collection
and analysis for quicker decision-making. Stream processing is more complex and often more expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data must be
formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for. Dirty
data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can turn big
data into big insights. Some of these big data analysis methods include:
Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the future,
identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
Q) Give Real-time applications of Big Data Analytics.
This industry also heavily relies on Big Data for risk analytics, including;
anti-money laundering, demand enterprise risk management, "Know Your
Customer," and fraud mitigation.
The Securities Exchange Commission (SEC) is using Big Data to monitor
financial market activity. They are currently using network analytics and
natural language processors to catch illegal trading activity in the financial
markets.
3. Healthcare Sector:
Some hospitals, like Beth Israel, are using data collected from a cell phone
app, from millions of patients, to allow doctors to use evidence- based
medicine as opposed to administering several medical/lab tests to all
patients who go to the hospital.
Free public health data and Google Maps have been used by the University of
Florida to create visual data that allows for faster identification and efficient
analysis of healthcare information, used in tracking the spread of chronic
disease.
4. Education:
The University of Tasmania, An Australian university with students has
deployed a Learning and Management System that tracks, among other
things, when a student logs onto the system, how much time is spent on
different pages in the system, as well as the overall progress of a student
over time.
On a governmental level, the Office of Educational Technology in the
U. S. Department of Education is using Big Data to develop analytics to
help correct course students who are going astray while using
online Big Data certification courses. Click patterns are also being used to
detect boredom.
5. Government:
In public services, Big Data has an extensive range of applications,
including energy exploration, financial market analysis, fraud detection,
health-related research, and environmental protection.
The Food and Drug Administration (FDA) is using Big Data to detect and
study patterns of food-related illnesses and diseases.
6. Insurance Industry:
Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analyzing and predicting customer behavior
through data derived from social media, GPS-enabled devices, and CCTV
footage. The Big Data also allows for better customer retention from insurance
companies.
7. Transportation Industry:
Some applications of Big Data by governments, private organizations, and
individuals include:
Governments use of Big Data: traffic control, route planning,
intelligent transport systems, congestion management (by predicting traffic
conditions)
Private-sector use of Big Data in transport: revenue management,
technological enhancements, logistics and for competitive advantage (by
consolidating shipments and optimizing freight movement)