Professional Documents
Culture Documents
Spark
Spark
Spark
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level
APIs in Java, Scala, Python and R, and an optimized engine that supports general execution
graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph processing, and
Spark Streaming.
Running spark:-
create a textfiLE called sundar.txt first
val ssfile = sc.textFile(“sundar.txt”)
val counts = ssfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_+_);
scala> counts.saveAsTextFile("output")
come out of scala and go to home .
[hadoop@localhost ~]$ cd output/
[hadoop@localhost output]$ ls -1
part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
Hive
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage using SQL. Structure can be projected onto data
already in storage. A command line tool and JDBC driver are provided to connect users to
Hive.
Hive Commands and output:-
[cloudera@quickstart ~]$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 1.625 seconds, Fetched: 1 row(s)
hive> create database data;
OK
Time taken: 3.148 seconds
hive> use data;
OK
Time taken: 0.103 seconds
hive> create table book(id int);
OK
Time taken: 0.921 seconds
hive> describe book;
OK
id int
Time taken: 0.442 seconds, Fetched: 1 row(s)
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/input.txt' into table book;
Copying data from file:/home/cloudera/Desktop/input.txt
Copying file: file:/home/cloudera/Desktop/input.txt
Loading data to table data.book
Table data.book stats: [numFiles=1, numRows=0, totalSize=26, rawDataSize=0]
OK
Time taken: 9.821 seconds
hive> select *from book;
OK
12
123
12234
1223
122344
NULL
Time taken: 0.677 seconds, Fetched: 6 row(s)
hive> select count(*) from book;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1467705493800_0001, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1467705493800_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1467705493800_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-07-05 01:21:44,225 Stage-1 map = 0%, reduce = 0%
2016-07-05 01:22:44,804 Stage-1 map = 0%, reduce = 0%
2016-07-05 01:23:06,020 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.67 sec
2016-07-05 01:23:31,354 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.9 sec
MapReduce Total cumulative CPU time: 5 seconds 900 msec
Ended Job = job_1467705493800_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.9 sec HDFS Read: 254 HDFS
Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 900 msec
OK
6
Time taken: 231.32 seconds, Fetched: 1 row(s)
hive> select sum(id) from book;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1467705493800_0002, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1467705493800_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1467705493800_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-07-05 01:27:11,959 Stage-1 map = 0%, reduce = 0%
2016-07-05 01:27:38,874 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.58 sec
2016-07-05 01:28:00,856 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.77 sec
MapReduce Total cumulative CPU time: 4 seconds 770 msec
Ended Job = job_1467705493800_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.77 sec HDFS Read: 254 HDFS
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 770 msec
OK
135936
Time taken: 70.603 seconds, Fetched: 1 row(s)
hive> create table book1(ds string,id int);
OK
Time taken: 1.021 seconds
hive> LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/chennai1.t' into table bookxt'
into table book1;
Copying data from file:/home/cloudera/Desktop/chennai1.txt
Copying file: file:/home/cloudera/Desktop/chennai1.txt
Loading data to table data.book1
Table data.book1 stats: [numFiles=1, numRows=0, totalSize=69, rawDataSize=0]
OK
Time taken: 0.931 seconds
hive> select * from book1;
OK
INDIA 12/12/12 NULL
INDIA 22/12/12 NULL
USA 12/12/16 NULL
USA 12/12/18 NULL
usa 12/12/18 NULL
Time taken: 0.138 seconds, Fetched: 5 row(s)
Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences
of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject). Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
storage virtualization: the amalgamation of multiple network storage devices into what
appears to be a single storage unit.
server virtualization: the partitioning a physical server into smaller virtual servers.
operating system-level virtualization: a type of server virtualization technology which works
at the operating system (kernel) layer.
network virtualization: using network resources through a logical segmentation of a single
physical network.
application virtualization
Hypervisor:
A hypervisor, also called a virtual machine manager, is a program that allows multiple
operating systems to share a single hardware host. Each operating system appears to have the
host's processor, memory, and other resources all to itself.
Deploying server
To deploy a server, the EC2 service of AWS is used to create a server of instance type t2-
micro with a full virtualization of Ubuntu 14.04 LTS installed on it. A VPC is chosen and
subnetting is done accordingly. An Elastic Block Storage can be attached according to the
requirement for storage space. A security group is set for allowing the selected ports only
allowing particular protocols available for communication. A key pair is generated to allow
access to the developer trying to access the server.
Creating an AMI
The sole reason to create an AMI is for deploying various similar servers and for disaster
management. Consider a server is terminated due to some error or is no more functional
and/or the Availability Zone is out of power/dysfunctional, the AMI comes to help as we can
now deploy the same server to any Availability Zone in a particular Region.
Auto Scaling
Auto Scaling is a feature provided by Amazon which automatically changes the configuration
by either increasing or decreasing some amount of servers according to the criteria set of an
alarm.
Auto Scaling is triggered through an alarm in the CloudWatch. The scaling takes place
automatically as per the norms set by the developer.
RDS
RDS is the relational database server provided by AWS for handling database in various
languages. One of the most used language is MySQL.
After deploying an RDS the MySQL needs to be installed on the server which is going to use
the services of the RDS created. All types of queries can be performed by the RDS.
http://wiki.apache.org/cassandra/GettingStarted
https://docs.mongodb.com/manual/reference/mongo-shell/
spark.apache.org/documentation.html
https://pig.apache.org
https://en.wikipedia.org/wiki/Big_data