Distribution, Data, Deployment: Software Arhitechure Convergence in Big Data System

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 50

DISTRIBUTION, DATA,

DEPLOYMENT: SOFTWARE
ARHITECHURE
CONVERGENCE IN BIG DATA
SYSTEM

INTRODUCTION:
Over the last two and a half years we have designed, implemented, and deployed a
distributed storage system for managing structured data at Google called Big table. Big table
is designed to reliably scale to petabytes of data and thousands of machines. Big table has
achieved several goals: wide applicability, scalability, high performance, and high
availability. Big table is used by more than sixty Google products and projects, including
Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.
These products use Big table for a variety of demanding workloads, which range from
throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
The Big table clusters used by these products span a wide range of configurations, from a
handful to thousands of servers, and store up to several hundred terabytes of data. In many
ways, Big table resembles a database: it shares many implementation strategies with
databases. Parallel databases and main-memory databases have achieved scalability and high
performance, but Big table provides a different interface than such systems. Big table does
not support a full relational data model; instead, it provides clients with a simple data model
that supports dynamic control over data layout and format, and allows clients to reason about
the locality properties of the data represented in the underlying storage. Data is indexed
using row and column names that can be arbitrary strings. Big table also treats data as
uninterrupted strings, although clients often serialize various forms of structured and semistructured data into these strings. Clients can control the locality of their data through careful
choices in their schemas. Finally, Big table schema parameters let clients dynamically control
whether to serve data out of memory or from disk.

SCOPE OF THE PROJECT


The Scope of the project is State-of-the-art in scalable data
management for traditional and cloud computing infrastructures for both
update heavy as well as analytical workloads. Summary of current
research projects and future research directions. Design choices that have
led to the success of the scalable systems, and the errors that limited the
success of some other systems. Design principles that should be carried
over in designing the next generation of data management systems for the
cloud. Understanding the design space for DBMS targeted to supporting
update intensive workloads for supporting large single tenant systems and
large multitenant systems.

LITERATURE SURVEY
1. Bigtable: A Distributed Storage System for Structured Data
Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat
Year: 2007
Description:
Bigtable is a distributed storage system for managing structured data that is designed to scale
to a very large size: peta bytes of data across thousands of commodity servers. Many projects
at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance.
These applications place very different demands on Bigtable, both in terms of data size (from
URLs to web pages to satellite imagery) and latency requirements (from backend bulk
processing to real-time data serving). Despite these varied demands, Bigtable has successfully
provided a _exible, high-performance solution for all of these Google products. In this paper
we describe the simple data model provided by Bigtable, which gives clients dynamic control
over data layout and format, and we describe the design and implementation of Bigtable. Over
the last two and a half years we have designed, implemented, and deployed a distributed
storage system for managing structured data at Google called Bigtable. Bigtable is designed to
reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several
goals: wide applicability, scalability, high performance, and high availability. Bigtable is used
by more than sixty Google products and projects, including Google Analytics, Google Finance,
Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a
variety of demanding workloads, which range from throughput-oriented batch-processing jobs
to latency-sensitive serving of data to end users. The Bigtable clusters used by these products
span a wide range of con_gurations, from a handful to thousands of servers, and store up to
several hundred terabytes of data.

2. Above the Clouds: A View of Cloud Computing


Authors: Michael Armbrust, Armando Fox, Rean Griffith
Year: 2008
Description:
Cloud Computing, the long-held dream of computing as a utility, has the potential
to transform a large part of the IT industry, making software even more attractive
as a service and shaping the way IT hardware is designed and purchased.
Developers with innovative ideas for new Internet services no longer require the
large capital outlays in hardware to deploy their service or the human expense to
operate it. They need not be concerned about over-provisioning for a service
whose popularity does not meet their predictions, thus wasting costly resources, or
under-provisioning for one that becomes wildly popular, thus missing potential
customers and revenue. Moreover, companies with large batch-oriented tasks can
get results as quickly as their programs can scale, since using 1000 servers for one
hour costs no more than using one server for 1000 hours. This elasticity of
resources, without paying a premium for large scale, is unprecedented in the
history of IT. As a result, Cloud Computing is a popular topic for blogging
and white papers and been featured in the title of workshops, conferences, and
even magazines.

3. Consistency tradoffs in modern Distributed database System


design
Authors: Daniel J. Abadi
Year: 2011
Description:
To understand modern DDBS design, it is important to realize the context
in which these systems were built. Amazon originally designed Dynamo
to serve data to the core services in its e-commerce platform (for example,
the shopping cart). Face book constructed Cassandra to power its Inbox
Search feature. LinkedIn created Voldemort to handle online updates from
various write-intensive features on its website. Yahoo built PNUTS to
store user data that can be read or written to on every webpage view, to
store listings data for Yahoos shopping pages, and to store data to serve
its social networking applications. Use cases similar to Amazons
motivated Riak. In each case, the system typically serves data for
webpage constructed on the fly and shipped to an active website user, and
receives online updates. Studies indicate that latency is a critical factor in
online interactions an increase as small as 100 ms can dramatically reduce
the probability that a customer will continue to interact or return in the
future.

4. F1: A Distributed SQL Database That Scales


Authors: Jeff Shute, Radek, Vingralek, Bart Samwel, Ben Handy
Year: 2013.
Description:
F1 is a distributed relational database system built at Google to
support the AdWords business. F1 is a hybrid database that combines
high availability, the scalability of NoSQL systems like Bigtable, and the
consistency and us- ability of traditional SQL databases. F1 is built on
Spanner, which provides synchronous cross-datacenter replication and
strong consistency. Synchronous replication implies higher commit
latency, but we mitigate that latency by using a hierarchical schema
model with structured data types and through smart application design.
F1 also includes a fully functional distributed SQL query engine and
automatic change tracking and publishing. F1 is a fault-tolerant
globally-distributed OLTP and OLAP database built at Google as the
new storage system for Google's AdWords system. It was designed to
replace a sharded MySQL implementation that was not able to meet our
growing scalability and reliability requirements.

5 Big Data and Cloud Computing: Current State and Future Opportunities
Authors: Divyakant Agrawal Sudipto Das Amr El Abbadi
Year: 2010
Description:
Scalable database management systems (DBMS)both for update intensive
application workloads as well as decision support systems for descriptive and deep
analyticsare a critical part of the cloud infrastructure and play an important role
in ensuring the smooth transition of applications from the traditional enterprise
infrastructures to next generation cloud infrastructures. Though scalable data
management has been a vision for more than three decades and much research has
focussed on large scale data management in traditional enterprise setting, cloud
computing brings its own set of novel challenges that must be addressed to ensure
the success of data management solutions in the cloud environment. This tutorial
presents an organized picture of the challenges faced by application developers and
DBMS designers in developing and deploying internet scale applications. Our
background study encompasses both classes of systems: (i) for supporting update
heavy applications, and (ii) for ad-hoc analytics and decision support. We then
focus on providing an in-depth analysis of systems for supporting update intensive
web-applications and provide a survey of the state-of-the art in this domain. We
crystallize the design choices made by some successful systems large scale
database management systems, analyze the application demands and access
patterns, and enumerate the desiderata for a cloud-bound DBMS.

6. Characterizing Cloud Computing Hardware Reliability


Authors: Kashi Venkatesh Vishwanath and Nachiappan Nagappan
Year: 2012
Description:
Modern day datacenters host hundreds of thousands of servers that coordinate tasks
in order to deliver highly available cloud computing services. These servers
consist of multiple hard disks, memory modules, network cards, processors
etc., each of which while carefully engineered are capable of failing. While the
probability of seeing any such failure in the lifetime (typically 3-5 years in
industry) of a server can be somewhat small, these numbers get magnified
across all devices hosted in a datacenter. At such a large scale, hardware
component failure is the norm rather than an exception. Hardware failure can
lead to a degradation in performance to end-users and can result in losses to the
business. A sound understanding of the numbers as well as the causes behind
these failures helps improve operational experience by not only allowing us to
be better equipped to tolerate failures but also to bring down the hardware cost
through engineering, directly leading to a saving for the company. To the best
of our knowledge, this paper is the first attempt to study server failures and
hardware repairs for large datacenters. We present a detailed analysis of failure
characteristics as well as a preliminary analysis on failure predictors. We hope
that the results presented in this paper will serve as motivation to foster further
research in this area.

7.A Bloat-Aware Design for Big Data Applications


Authors: Yingyi Bu Vinayak Borkar Guoqing Xu Michael J. Carey
Year: 2010
Description:
Over the past decade, the increasing demands on data-driven business intelligence have led to the
proliferation of large-scale, dataintensive applications that often have huge amounts of data (often at
terabyte or petabyte scale) to process. An object-oriented programming language such as Java is
often the developers choice for implementing such applications, primarily due to its quick
development cycle and rich community resource.While the use of such languages makes
programming easier, significant performance problems can often be seen the combination of the
inefficiencies inherent in a managed run-time system and the impact of the huge amount of data to be
processed in the limited memory space often leads to memory bloat and performance degradation at
a surprisingly early stage. This paper proposes a bloat-aware design paradigm towards the
development of efficient and scalable Big Data applications in object-oriented GC enabled
languages. To motivate this work, we first perform a study on the impact of several typical memory
bloat patterns. These patterns are summarized from the user complaints on the mailing lists of two
widely-used open-source Big Data applications. Next, we discuss our design paradigm to eliminate
bloat. Using examples and real-world experience, we demonstrate that programming under this
paradigm does not incur significant programming burden. We have implemented a few common data
processing tasks both using this design and using the conventional object-oriented design. Our
experimental results show that this new design paradigm is extremely effective in improving
performance even for the moderate-size data sets processed, we have observed 2.5+
performance gains, and the improvement grows substantially with the size of the data set.

PROJECT IMPLEMENTATION
Modules:
ADMIN
Authentication.
View Patient Records.
Preference Setting.
Generate EMR.
Generate MIS Report.
USER
Authentication.
Diagnose Schedule
EMR Details
View MIS Report

Module Description & Diagrams:


1. Authentication
Login:
The user has to provide exact username and password which was
provided at the time of registration, if login success means it will take up
to main page else it will remain in the login page itself.

Login

Check
status

Database

Proceed to
hierarchy

next

View Patient Records


In this scheme Admin can view all the patient information. The
record shows the patient treatment request information and other
information stored in the BDS in the document format.

View Patient
information

Login

Admin

BDS

Preference Setting
The admin will be view the patient information and set the priority as
based on the treatment. This is based on the type of treatment which they
mention in the document.

Admin

Login

View
Patient
informatio
n

Set
preferenc
e

BDS

Generate MIS Reports


The Admin maintain MIS records for the patient who takes treatment
in the hospital. This sends finally to the patient when they leave from
the hospital. It contains Billing information.
Login

Admin

Maintain
Patient
informatio
n

Generate
MIS
Report

BDS

Generate EMR
The admin will collect the patient details and stored in the
BDS. This EMR contains all the records from enquiry to till in patient
discharge information has been collected and stored in the BDS.

Admin

Login

Collect
Patient
informatio
n

Generate
Single
EMR

BDS

Login
The user has to provide exact username and password which was
provided at the time of registration, if login success means it will take up
to main page else it will remain in the login page itself.

LOGIN

CHECK
STATUS

Proceed To next
stage
Hierarchy

DB

Diagnose Schedule
The user after the successful login goes to view the schedule for take
initial treatment that provides by the admin. It contains records and
scheduling information.

Patien
Patien
t1
t1

User

Login
Patien
Patien
t2
t2

Patien
Patien
t3
t3

BDS

View EMR detail


In this phase the authenticated users view the whole report for that
treatment. The user will see all the information from the initial enquiry
up to the final test.

User

View
Diagnose
Schedule

Check
Prefere
nce

View EMR
records

BDS

View MIS Details


In this module the user will view the billing information and other
maintenance and financial information such as payment mode or
insurance type and so on.

User

View
Diagnose
Schedule

View EMR
records

View MIS
Details

BDS

GIVEN INPUT AND EXPECTED OUTPUT:


1. ADMIN
Authentication
Input: Provide username and password to get permission for access.
Output: Became authenticated person to request and process the request.
View Patient Records
Input: The patient information is stored in BDS and it access for Admin
Output: It will show the patient illness and other information.
Preference Setting
Input: Admin provides and set priority for the user based on the disease.
Output: It will set the preference and diagnose schedule to the user
Generate EMR
Input: Admin maintain the patient records in the BDS
Output: It will generate EMR which includes all the reports of a particular
employee
Generate MIS Report
Input: Admin maintain the patient records in the BDS
Output: It will generate MIS Report which includes billing and medicine

2. USER
Authentication
Input: Provide username and password to get permission for access.
Output: Became authenticated person to request and process the request.
Diagnose Schedule
Input: Admin will give the schedule for the consultation
Output: It will show consultation time and date.
View EMR Detail
Input: Admin generated EMR report is send to the patient
Output: It will show the patient treatment report and scanning report detail.
View MIS Report
Input: The retrieval data will send to the corresponding patient.
Output: It will show the MIS details

TECHNIQUE USED OR ALGORITHM USED:


Big data System
Big table supports several other features that allow the user to
manipulate data in more complex ways. First, Big table supports singlerow transactions, which can be used to perform atomic read-modify-write
sequences on data stored under a single row key. Big table does not
currently support general transactions across row keys, although it
provides an interface for batching writes across row keys at the clients.
Second, Big table allows cells to be used as integer counters. Finally, Big
table supports the execution of client-supplied scripts in the address spaces
of the servers. The scripts are written in a language developed at Google
for processing data called Sawzall. At the moment, our Sawzall-based API
does not allow client scripts to write back into Bigtable, but it does allow
various forms of data transformation ,ltering based on arbitrary
expressions, and summarization via a variety of operators.

HARDWARE AND SOFTWARE REQUIREMENTS:


SOFTWARE REQUIREMENTS
Operating system

: Windows7

IDE

: Microsoft Visual Studio .NET 2010

Front End

: ASP.NET

Coding Language
Backend

: C#
: SQL Server 2008

HARDWARE REQUIREMENTS
Processor

: Pentium Dual Core 2.00GHZ

Hard disk

: 40 GB

Mouse

: Logitech.

RAM

: 2GB(minimum)

Keyboard

: 110 keys enhanced.

SYSTEM DESIGN:
USE CASE DIAGRAM:
A use case diagram is a type of behavioral diagram created
from a Use-case analysis. The purpose of use case is to present overview
of the functionality provided by the system in terms of actors, their goals
and any dependencies between those use cases.

Login

Enter User treatment Information

USER
View Patient information

Generate EMR

View EMR

Generate MIS

View MIS details

ADMIN

CLASS DIAGRAM:
A class diagram in the UML is a type of static structure diagram that
describes the structure of a system by showing the systems classes, their
attributes, and the relationships between the classes.
Private visibility hides information from anything outside the class
partition. Public visibility allows all other classes to view the marked
information.
Protected visibility allows child classes to access information they
inherited from a parent class.
Patient
name
Address
Profession
mail
mob

Administrator
name
address
mob no
mailid
EMR()
MIS()
Schedule()

view()
registration()

Big Data System


Name
size
update()
Generate()

OBJECT DIAGRAM:
An object diagram in the Unified Modeling Language (UML) is
a diagram that shows a complete or partial view of the structure of a
modeled system at a specific time.
An Object diagram focuses on some particular set of object instances
and attributes, and the links between the instances. A correlated set of
object diagrams provides insight into how an arbitrary view of a system is
expected to evolve over time.
Object diagrams are more concrete than class diagrams, and are often
used to provide examples, or act as test cases for the class diagrams. Only
those aspects of a model that are of current interest need be shown on an
object diagram.

Register Details
User Id=xxxx
Name=yyyy

View information
User Id=xxxx
Name=yyyy
Treatment= t1,t2..
Generate EMR
User Id= xxx

Login
Username= Admin/
User
PassWord=*****

Treatment=t1,t2
Report=L1,L2..
View EMR Details
User Id= xxx
Treatment=t1,t2
Report=L1,L2..
Generate MIS
Record
Payment opt=O1 or
O2
Test report=r1,r2

View MIS details


Report list=l1,l2,l3..
Checkup date=D1,D2..

STATE DIAGRAM:
A state diagram is a type of diagram used in computer science and
related fields to describe the behavior of systems. State diagrams require
that the system described is composed of a finite number of states;
sometimes, this is indeed the case, while at other times this is a
reasonable abstraction. There are many forms of state diagrams, which
differ slightly and have different semantics.
Registration

User Login

Admin Login

View Patient
details

View Schedule
details

Generate EMR
Report

View EMR
Record

Generate MIS
details

View MIS
detail

Logout

ACTIVITY DIAGRAM
Activity diagram are a loosely defined diagram to show workflows of
stepwise activities and actions, with support for choice, iteration and
concurrency. UML, activity diagrams can be used to describe the business
and operational step-by-step workflows of components in a system. UML
activity diagrams could potentially model the internal logic of a complex
operation. In many ways UML activity diagrams are the object-oriented
equivalent of flow charts and data flow diagrams (DFDs) from structural
development.

No

Register

Check the valid


user

Login

User

Admin

View Diagnose
Schedule

Set preference

View EMR
Record

Generate EMR
record

View MIS details

Generate MIS
detail

Logout

SEQUENCE DIAGRAM:
A sequence diagram in UML is a kind of interaction diagram that
shows how the processes operate with one another and in what order.
It is a construct of a message sequence chart. Sequence diagrams are
sometimes called Event-trace diagrams, event scenarios, and timing
diagrams.
The below diagram shows the sequence flow shows how the
process occurs in this project.

User

Admin

Big Data
System

Login User

Register Details

View Diagnose Schedule

Admin login

Setting Priority

Generate EMR

Generate MIS report

Admin Logout

View EMR detail

View MIS report

User Logout

COLLABORATION DIAGRAM:
A collaboration diagram show the objects and relationships involved
in an interaction, and the sequence of messages exchanged among the
objects during the interaction.
The collaboration diagram can be a decomposition of a class, class
diagram, or part of a class diagram. It can be the decomposition of a use
case, use case diagram, or part of a use case diagram.
The collaboration diagram shows messages being sent between
classes and object (instances). A diagram is created for each system
operation that relates to the current development cycle (iteration).

User

1: Login User
2: Register Details
3: View Diagnose Schedule
9: View EMR detail
10: View MIS report
11: User Logout

Big Data
System

4: Admin login
5: Setting Priority
6: Generate EMR
7: Generate MIS report
8: Admin Logout

Admin

COMPONENT DIAGRAM:
Components are wired together by using an assembly connector to
connect the required interface of one component with the provided
interface of another component. This illustrates the service consumer service provider relationship between the two components.
An assembly connector is a "connector between two components that
defines that one component provides the services that another component
requires. An assembly connector is a connector that is defined from a
required interface or port to a provided interface or port.
When using a component diagram to show the internal structure of a
component, the provided and required interfaces of the encompassing
component can delegate to the corresponding interfaces of the contained
components.

View Diagnose
Schedule

View EMR
Record

View MIS Detail

Logout

Login

Set Preference
Setting

Generate
EMR

Generate MIS
Details

DATA FLOW DIAGRAM


A data flow diagram (DFD) is a graphical representation of the flow
of data through an information system. It differs from the flowchart as it
shows the data flow instead of the control flow of the program. A data
flow diagram can also be used for the visualization of data processing. The
DFD is designed to show how a system is divided into smaller portions
and to highlight the flow of data between those parts.
Data Flow Diagram (DFD) is an important technique for modeling a
systems high-level detail by showing how input data is transformed to
output results through a sequence of functional transformations. DFDs
reveal relationships among and between the various components in a
program or system. DFDs consist of four major components: entities,
processes, data stores and data flow.

LEVEL 0:

LEVEL 1:

LEVEL 2:

ALL LEVEL:

E-R DIAGRAM:
In software engineering, an Entity-Relationship Model (ERM) is
an abstract and conceptual representation of data. Entity-relationship
modeling is a database modeling method, used to produce a type
of conceptual schema or semantic data model of a system, often
a relational database, and its requirements in a top-down fashion.
Diagrams created by this process are called Entity-Relationship
Diagrams, ER diagrams, or ERDs.

Userna
me

Passwor
d

Login

Admin

User

View
Patient
informati
on

Vie
w

Preferen
ce
Setting

View
MIS
details
Genera
te EMR
Genera
te MIS
Report

Registe
r
Patient
details

View
Diagnose
Schedule

View
EMR
Record

SYSTEM ARCHITECTUER.
Architecture diagram shows the relationship between different
components of system. This diagram is very important to understand the
overall concept of system. Architecture diagram is a diagram of a system,
in which the principal parts or functions are represented by blocks
connected by lines that show the relationships of the blocks. They are
heavily used in the engineering world in hardware design, electronic
design, software design, and process flow diagrams

User

Global
Networ
k

Admin

Big Data
System

FUTURE ENHANCEMENT:
Description:
In the future, we plan to conduct an empirical evaluation to assess
how the replicate can store in the Big data system.

Module Diagram:

User

View
Files

Request a
file

Verify the
key and
count

Retrieve
a file

GIVEN INPUT AND EXPECTED OUTPUT


Input: User provides periodic information about their health condition.
Output: Admin will consult the patient and provides some tips.

ADVANTANGES
High availability
Write heavy workloads
Variable request loads
APPLICATIONS
Online Banking System
Financial Management system
Health Care System

CONCLUSION
We have described Big table, a distributed system for storing
structured data at Google. Our users like the performance and high
availability provided by the Big table implementation, and that they can
scale the capacity of their clusters by simply adding more machines to
the system as their resource demands change over time.

REFERENCES OR BIBLIOGRAPHY:
1.P. Ciuccarelli, M.I. Sessa, and M. Tucci, Code: A Graphic Language for Complex
System Visualization, Proc. Italian Assoc. for Information Systems (ItAIS), 2010.
2. J. Bertin, Semiology of Graphics: Diagrams, Networks, Maps/Jacques Bertin;
Translated by William J. Berg., Univ. of Wisconsin Press, 1983.
3. S.K. Card, J.D. Mackinlay, and B. Shneiderman, Readings in Information
Visualization: Using Vision to Think. Morgan Kaufmann Publishers Inc., 1999.
4. E.R. Tufte, The Visual Display of Quantitative Information, second ed. Graphics
Press, May 2001.
5. A. Unger, P. Muigg, H. Doleisch, and H. Schumann, Visualizing Statistical
Properties of Smoothly Brushed Data Subsets, Proc. Fourth Intl Conf. Information
Visualisation, pp. 233-239, 2008.
6. J. Heer and G. Robertson, Animated Transitions in Statistical Data Graphics,
IEEE Trans. Visualization and Computer Graphics, vol. 13, no. 6, pp. 1240-1247,
Nov. 2007.
7. E. Papageorgiou, Review Study on Fuzzy Cognitive Maps and Their
Applications during the Last Decade, Proc. IEEE Intl Conf. Fuzzy Systems
(FUZZ), pp. 828-835, 2011.

8. D. Iakovidis and E. Papageorgiou, Intuitionistic Fuzzy Cognitive Maps for Medical


Decision Making, IEEE Trans. Information Technology in Biomedicine, vol. 15, no. 1, pp.
100-107, Jan. 2011.
9. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, third ed. Prentice
Hall, 2009.
10. M. Risi, M.I. Sessa, G. Tortora, and M. Tucci, Visualizing Information in Data
Warehouses Reports, Proc. 19th Italian Symp. Advanced Database Systems (SEBD), pp.
246-257, 2011.
11. S. Chaudhuri and U. Dayal, An Overview of Data Warehousing and OLAP Technology,
SIGMOD Record, vol. 26, pp. 65-74, 1997.
12. J. Liu, Y. Wu, and G. Yang, Optimization of Data Retrievals in Processing Data
Integration Queries, Proc. Intl Conf. Frontier of CS and Technology (FCST), pp. 183-189,
2009.

THANK YOU

You might also like