Professional Documents
Culture Documents
CC Unit 04
CC Unit 04
A cloud provides the vast amounts of storage and computing cycles demanded by many applications. The
network-centric content model allows a user to access data stored on a cloud from any device con-nected to the Internet.
Mobile devices with limited power reserves and local storage take advantage of cloud environments to store audio and
video files. Clouds provide an ideal environment for multimedia content delivery.
A variety of sources feed a continuous stream of data to cloud applications. An ever-increasing number of
cloud-based services collect detailed data about their services and information about the users of these services. Then the
service providers use the clouds to analyze that data.
A new concept, “big data,” reflects the fact that many applications use data sets so large that they cannot be
stored and processed using local resources. The consensus is that “big data” growth can be viewed as a three-dimensional
phenomenon; it implies an increased volume of data, requires an increased processing speed to process more data and
produce more results, and at the same time it involves a diversity of data sources and data types.
cloud Storage is a service where data is remotely maintained, managed, and backed up. The service allows the users to
store files online, so that they can access them from any location via the Internet. According to a recent survey conducted
with more than 800 business decision makers and users worldwide, the number of organizations gaining competitive
advantage through high cloud adoption has almost doubled in the last few years and by 2017, the public cloud services
market is predicted to exceed $244 billion. Now, let’s look into some of the advantages and disadvantages of Cloud
Storage.
Advantages of Cloud Storage
1. Usability: All cloud storage services reviewed in this topic have desktop folders for Mac’s and PC’s. This allows users
to drag and drop files between the cloud storage and their local storage.
2. Bandwidth: You can avoid emailing files to individuals and instead send a web link to recipients through your email.
3. Accessibility: Stored files can be accessed from anywhere via Internet connection.
4. Disaster Recovery: It is highly recommended that businesses have an emergency backup plan ready in the case of an
emergency. Cloud storage can be used as a back‐up plan by businesses by providing a second copy of important files.
These files are stored at a remote location and can be accessed through an internet connection.
5. Cost Savings: Businesses and organizations can often reduce annual operating costs by using cloud storage; cloud
storage costs about 3 cents per gigabyte to store data internally. Users can see additional cost savings because it does not
require internal power to store information remotely.
1. Usability: Be careful when using drag/drop to move a document into the cloud storage folder. This will permanently
move your document from its original folder to the cloud storage location. Do a copy and paste instead of drag/drop if
you want to retain the document’s original location in addition to moving a copy onto the cloud storage folder.
2. Bandwidth: Several cloud storage services have a specific bandwidth allowance. If an organization surpasses the
given allowance, the additional charges could be significant. However, some providers allow unlimited bandwidth. This
is a factor that companies should consider when looking at a cloud storage provider.
3. Accessibility: If you have no internet connection, you have no access to your data.
4. Data Security: There are concerns with the safety and privacy of important data stored remotely. The possibility of
private data commingling with other organizations makes some businesses uneasy. If you want to know more about those
issues that govern data security and privacy, here is an interesting article on the recent privacy debates.
5. Software: If you want to be able to manipulate your files locally through multiple devices, you’ll need to download the
service on all devices.
The technological capacity to store information has grown over time at an accelerated pace :
1986: 2.6 EB; equivalent to less than one 730 MB CD-ROM of data per computer user.
1993: 15.8 EB; equivalent to four CD-ROMs per user.
2000: 54.5 EB; equivalent to 12 CD-ROMs per user.
2007: 295 EB; equivalent to almost 61 CD-ROMs per user.
6.3 Storage models, file systems, and databases
A storage model describes the layout of a data structure in physical storage; a data model captures
the most important logical aspects of a data structure in a database. The physical storage can be a
local disk, a removable media, or storage accessible via a network.
The General Parallel File System (GPFS) was developed at IBM in the early 2000s as a successor to the
TigerShark multimedia file system. GPFS is a parallel file system that emulates closely the behavior of a general-purpose
POSIX system running on a single system. GPFS was designed for optimal performance of large clusters; it can support a
file system of up to 4 PB consisting of up to 4, 096 disks of 1 TB each (see Figure).
The maximum file size is (2 pow 63−1) bytes. A file consists of blocks of equal size, ranging from 16 KB to 1
MB striped across several disks. The system could support not only very large files but also a very large number of files.
The directories use extensible hashing techniques to access a file. The system maintains user data, file metadata such as
the time when last modified, and file system metadata such as allocation maps. Metadata, such as file attributes and data
block addresses, is stored in inodes and indirect blocks.
File systems were originally developed for centralized computer systems and desktop computers as an
operating system facility providing a convenient programming interface to disk storage.
They subsequently acquired features such as access-control and file-locking mechanisms that made them
useful for the sharing of data and programs.
Distributed file systems support the sharing of information in the form of files and hardware resources in the
form of persistent storage throughout an intranet. A welldesigned file service provides access to files stored at a server
with performance and reliability similar to, and in some cases better than, files stored on local disks.
Ones, allowing users to access their files from any computer in an intranet. The concentration of persistent
storage at a few servers reduces the need for local disk storage and (more importantly) enables economies to be made in
the management and archiving of the persistent data owned by an organization. Other services, such as the name service,
the user authentication service and the print service, can be more easily
GFS was built primarily as the fundamental storage service for Google’s search engine. As the size of the
web data that was crawled and saved was quite substantial, Google needed a distributed file system to
redundantly store massive amounts of data on cheap and unreliable computers. None of the traditional distributed
file systems can provide such functions and hold such large amounts of data. In addition, GFS was designed for
Google applications, and Google applications were built for GFS. In traditional file system design, such a philosophy
is not attractive, as there should be a clear interface between applications and the file system, such as a POSIX
interface.
Thus, Google made some special decisions regarding the design of GFS. As noted earlier, a 64 MB block
size was chosen. Reliability is achieved by using replications (i.e., each chunk or data block of a file is replicated
across more than three chunk servers). A single master coordinates access as well as keeps the metadata.
The architecture of GFS will look very familiar if you know HDFS. In GFS, there is a single master server
(similar to HDFS Name Node) and one chunkserver per server (similar to HDFS Data Node). The files are broken
down to large, fixed-size chunks of 64MB (similar to HDFS blocks), which are stored as local linux files and are
replicated for HA (three replicas by default). The master maintains all the metadata of the files and chunks in-
memory.
• Job Submission Each job is submitted from a user node to the JobTracker node that might be situated in a
different node within the cluster through the following procedure:
➢ A user node asks for a new job ID from the JobTracker and computes input file splits.
➢ The user node copies some resources, such as the job’s JAR file, configuration file, and computed input
splits, to the JobTracker’s file system.
➢ The user node submits the job to the JobTracker by calling the submitJob() function.
• Task assignment The JobTracker creates one map task for each computed input split by the user node
and assigns the map tasks to the execution slots of the TaskTrackers. The JobTracker considers the
localization of the data when assigning the map tasks to the TaskTrackers. The JobTracker also creates
reduce tasks and assigns them to the TaskTrackers. The number of reduce tasks is predetermined by the
user, and there is no locality consideration in assigning them.
• Task execution The control flow to execute a task (either map or reduce) starts inside the TaskTracker
by copying the job JAR file to its file system. Instructions inside the job JAR file are executed after
launching a Java Virtual Machine (JVM) to run its map or reduce task.
• Task running check A task running check is performed by receiving periodic heartbeat messages to the
JobTracker from the TaskTrackers. Each heartbeat notifies the JobTracker that the sending askTracker is
alive, and whether the sending TaskTracker is ready to run a new task.
Bigtable
Google BigTable is a nonrelational, distributed and multidimensional data storage mechanism built on the
proprietary Google storage technologies for most of the company's online and back-end applications/products. It
provides scalable data architecture for very large database infrastructures.
Google BigTable is mainly used in proprietary Google products, although some access is available in the
Google App Engine and third-party database applications.
Google BigTable is a persistent and sorted map. Each string in the map contains a row, columns (several
types) and time stamp value that is used for indexing. For example, the string of data for a website is saved as
follows:
The reversed URL address is saved as the row name (com.google.www).
The content column stores the Web page contents.
The anchor content saves any anchor text or content referencing the page.
A time stamp provides the exact time when the data was stored and is used for sorting multiple instances
of a page.
Google BigTable is built on technologies like Google File System (GFS) and SSTable. It is used by more than
60 Google applications, including Google Finance, Google Reader, Google Maps, Google Analytics and Web
indexing.
Megastore
Megastore is a storage system developed to meet the requirements of today's interactive online services.
Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way,
and provides both strong consistency guarantees and high availability. We provide fully serializable ACID
semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write
across a wide area network with reasonable latency and support seamless failover between datacenters. This
paper describes Megastore's semantics and replication algorithm. It also describes our experience supporting a
wide range of Google production services built with Megastore.
The middle ground between traditional and NoSQL databases taken by the Megastore designers is also
reflected in the data model. The data model is declared in a schema consisting of a set of tables composed of
entries, each entry being a collection of named and typed properties. The unique primary key of an entity in a table
is created as a composition of entry properties. A Megastore table can be a root or a child table. Each child entity
must reference a special entity, called a root entity in its root table. An entity group consists of the primary entity
and all entities that reference it.
The system makes extensive use of BigTable. Entities from different Megastore tables can be mapped to
the same BigTable row without collisions. This is possible because the BigTable column name is a concatenation of
the Megastore table name and the name of a property. A BigTable row for the root entity stores the transaction
and all metadata for the entity group. As we saw in Section 8.9, multiple versions of the data with different time
stamps can be stored in a cell. Megastore takes advantage of this feature to implement multi-version concurrency
control (MVCC); when a mutation of a transaction occurs, this mutation is recorded along with its time stamp,
rather than marking the old data as obsolete