Spark

SPARK
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level
APIs in Java, Scala, Python and R, and an optimized engine that supports general execution
graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph processing, and
Spark Streaming.
Running spark:-
create a textfiLE called sundar.txt first
val ssfile = sc.textFile(“sundar.txt”)
val counts = ssfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_+_);
scala> counts.saveAsTextFile("output")
come out of scala and go to home .
[hadoop@localhost ~]$ cd output/
[hadoop@localhost output]$ ls -1
part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
[hadoop@localhost output]$ cat part-00000
Hive
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL. Structure can be projected onto data
already in storage. A command line tool and JDBC driver are provided to connect users to
Hive.
Hive Commands and output:-
[cloudera@quickstart ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 1.625 seconds, Fetched: 1 row(s)
hive> create database data;
OK
Time taken: 3.148 seconds
hive> use data;
OK
hive> create table book(id int);
OK
hive> describe book;
OK
id int
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/input.txt' into table book;
Copying data from file:/home/cloudera/Desktop/input.txt
Copying file: file:/home/cloudera/Desktop/input.txt
Loading data to table data.book
Table data.book stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0]
OK
hive> select *from book;
OK
12
123
12234
1223
122344
NULL
hive> select count(*) from book;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1467705493800_0001, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1467705493800_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1467705493800_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-07-05 01:21:44,225 Stage-1 map = 0%, reduce = 0%
2016-07-05 01:22:44,804 Stage-1 map = 0%, reduce = 0%
2016-07-05 01:23:06,020 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.67 sec
MapReduce Total cumulative CPU time: 5 seconds 900 msec
Ended Job = job_1467705493800_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.9 sec HDFS Read: 254 HDFS
Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 900 msec
OK
6
hive> select sum(id) from book;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1467705493800_0002, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1467705493800_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1467705493800_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-07-05 01:27:11,959 Stage-1 map = 0%, reduce = 0%
MapReduce Total cumulative CPU time: 4 seconds 770 msec
Ended Job = job_1467705493800_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.77 sec HDFS Read: 254 HDFS
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 770 msec
OK
135936
hive> create table book1(ds string,id int);
OK
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/chennai1.t' into table bookxt'
into table book1;
Copying data from file:/home/cloudera/Desktop/chennai1.txt
Copying file: file:/home/cloudera/Desktop/chennai1.txt
Loading data to table data.book1
Table data.book1 stats: [numFiles=1, numRows=0, totalSize=69, rawDataSize=0]
OK
hive> select * from book1;
OK
INDIA 12/12/12 NULL
INDIA 22/12/12 NULL
USA 12/12/16 NULL
USA 12/12/18 NULL
usa 12/12/18 NULL
Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences
of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject). Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly

parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.
CLOUD COMPUTING USING AMAZON WEB SERVICES

2.1 virtualization
In computing, virtualization means to create a virtual version of a device or resource, such as
a server, storage device, network or even an operating system where the framework divides
the resource into one or more execution environments. Even something as simple as
partitioning a hard drive is considered virtualization because you take one drive and partition
it to create two separate hard drives. Devices, applications and human users are able to
interact with the virtual resource as if it were a real single logical resource. The term
virtualization has become somewhat of a buzzword, and as a result the term is now associated
with a number of computing technologies including the following:
storage virtualization: the amalgamation of multiple network storage devices into what
appears to be a single storage unit.
server virtualization: the partitioning a physical server into smaller virtual servers.
operating system-level virtualization: a type of server virtualization technology which works
at the operating system (kernel) layer.
network virtualization: using network resources through a logical segmentation of a single
physical network.
application virtualization
Hypervisor:
A hypervisor, also called a virtual machine manager, is a program that allows multiple
operating systems to share a single hardware host. Each operating system appears to have the
host's processor, memory, and other resources all to itself.
Deploying server
To deploy a server, the EC2 service of AWS is used to create a server of instance type t2-
micro with a full virtualization of Ubuntu 14.04 LTS installed on it. A VPC is chosen and
subnetting is done accordingly. An Elastic Block Storage can be attached according to the
requirement for storage space. A security group is set for allowing the selected ports only
allowing particular protocols available for communication. A key pair is generated to allow
access to the developer trying to access the server.
Deploying Apache on server

The following commands are given through the terminal accessing the server through an SSH
connection:
sudo apt-get install apache2
Deploying Web Application

After Apache has been installed, the following directory is set as the main page for the
website: /var/www/html/index.html
The web application can be uploaded and various web pages can be linked together.
Creating an AMI
The sole reason to create an AMI is for deploying various similar servers and for disaster
management. Consider a server is terminated due to some error or is no more functional
and/or the Availability Zone is out of power/dysfunctional, the AMI comes to help as we can
now deploy the same server to any Availability Zone in a particular Region.
Deploying an EBS across different Availability Zones

An EBS can be deployed from one Availability Zone to another by taking its snapshot.
Then the EBS can be unmounted from the current server and then the snapshot is made into a
volume via the Action menu and the Availability Zone can be determined while volume
creation.
Then the EBS can be mounted to the server available in the chosen Availability Zone.
Deploying an Elastic Load Balancer

An Elastic Load Balancer balances the traffic going to a particular IP address by distributing
the traffic equally among the servers deployed for the same task.
After deploying an Elastic Load Balancer, the IP of the ELB is used instead of the IP
addresses of the servers.
Alerts for deployed servers

Alerts can be set for a server if its CPU usage/memory usage, etc. increase or decrease a
particular percentage or value. For this alarm is created using CloudWatch. The notification is
sent to the mentioned e-mail ids during the alarm creation.
Auto Scaling
Auto Scaling is a feature provided by Amazon which automatically changes the configuration
by either increasing or decreasing some amount of servers according to the criteria set of an
alarm.
Auto Scaling is triggered through an alarm in the CloudWatch. The scaling takes place
automatically as per the norms set by the developer.
RDS
RDS is the relational database server provided by AWS for handling database in various
languages. One of the most used language is MySQL.
After deploying an RDS the MySQL needs to be installed on the server which is going to use
the services of the RDS created. All types of queries can be performed by the RDS.
User Creation with limited access

We can create various users for the same server without the root permissions.
They can be granted access to some particular services which are available.
Using Route 53 to connect the servers with a domain name

Route 53 provides domain name service. A hosted zone needs to be created which contains
the servers which are already created. The server which needs to be a part of the hosted zone
can be chosen by the developer. After the hosted name is created successfully, the hosted
zone contains name servers. These name servers needs to be entered into the name servers of
the bought domain. The domain which is bought now points to the hosted zone.
Bibiliographic References
http://wiki.apache.org/cassandra/GettingStarted
https://docs.mongodb.com/manual/reference/mongo-shell/
spark.apache.org/documentation.html
https://pig.apache.org
https://en.wikipedia.org/wiki/Big_data

Spark

Uploaded by

Copyright:

Available Formats

You might also like

Spark

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark

Uploaded by

Copyright:

Available Formats

SPARK

[hadoop@localhost output]$ cat part-00000

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly

CLOUD COMPUTING USING AMAZON WEB SERVICES

Deploying Apache on server

Deploying Web Application

Deploying an EBS across different Availability Zones

Deploying an Elastic Load Balancer

Alerts for deployed servers

User Creation with limited access

Using Route 53 to connect the servers with a domain name

You might also like