Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 42

A Project Report

On
Twitter Sentiment Analysis
Submitted in Partial Fulfillment of the requirement for the award of the degree of
BACHELOR OF TECHNOLOGY
(Computer Science and Engineering)

Submitted by
Shubham
160970101046
Under the Guidance of
Mr. Azmat Siddiqui

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

THDC INSTITUTE OF HYDROPOWER


ENGINEERING & TECHNOLOGY
TEHRI, UTTARAKHAND
(Uttarakhand Technical University, Dehradun)
2016-2020

1
CERTIFICATE

I hereby certify that the work which is being presented in the report entitled “Twitter Sentiment
Analysis” in partial fulfillment of the requirement for the award of degree of Bachelor of Technology is
uniquely prepared by me after the completion of 40 days internship under the supervision of Mr. Azmat
Siddiqui (Data Scientist) at KVCH.

I also confirm that, the report is prepared by me and all the information, facts and figure in this report is
based on my own experience and study during the summer internship.

Date:

Signature of the Candidate Signature of Internal faculty Supervisor

Shubham
160970101046

2
ACKNOWLEDGEMENT

The internship opportunity I had with KVCH, Noida in partial fulfillment of BTech (CSE)
program under THDC- Institute of Hydropower and Engineering Technology, New Tehri
was a great chance for learning and professional development. Therefore, I consider myself
as a lucky individual as I was provided with an opportunity to be a part of it. At the outset, I
would like to express our gratitude to our HOD Sir Mr. Ashish Joshi, faculty members and
training guide Mr. Azmat Siddiqui (Data Scientist) for guiding me right from the inception
till the successful completion of training and extending their valuable guidance about
“Recruitment and Selection” and support on critical views of the project.

Guided by: -
Submitted by: -

Shubham Mr. Azmat Siddiqui


160970101046 (Data Scientist)

3
Contents
Introduction to BigData ...................................................................................................... 5
1.1 What is BigData? ...................................................................................................... 5
1.2 Apache Hadoop ......................................................................................................... 6
1.3 Google File System ................................................................................................... 7
1.4 History ....................................................................................................................... 8
Software Installation ......................................................................................................... 10
2.1 VMware Workstation .............................................................................................. 10
2.1.1 Tools of VMware.............................................................................................. 11
2.2 Ubuntu ..................................................................................................................... 11
2.2.1 Installation of Ubuntu on VMware workstation ............................................... 13
2.3 Hortonworks Framework ........................................................................................ 18
2.3.1 Hortonworks Sandbox ...................................................................................... 19
2.3.2 Importing sandbox in VMware......................................................................... 19
Hadoop Installation ........................................................................................................... 21
3.1 Installation in Standalone mode .............................................................................. 21
3.2 Installation of Pseudo Distributed mode of Hadoop .............................................. 25
Installation of Pig and Hive .............................................................................................. 32
4.1 Pig............................................................................................................................ 32
4.1.1 Pig vs SQL ........................................................................................................ 32
4.2 Installation of Pig .................................................................................................... 33
4.3 Hive ......................................................................................................................... 34
4.3.1 HiveQL ............................................................................................................. 35
4.4 Hive installation ...................................................................................................... 35
Sentimental Analysis in Hive............................................................................................ 39
5.1 Adding Serde ........................................................................................................... 39
5.2 Analysis Part ........................................................................................................... 39
5.3 Creating External Table for Tweets Storage ........................................................... 39
5.4 Final Result ............................................................................................................. 42

4
Introduction to BigData

1.1 What is BigData?


The data produced by different devices and applications in huge amount is BigData. The
data sets which are very large and complex that traditional data processing applications
are inadequate.
Some fields that generate big data are.

 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It


captures voices of the flight crew, recordings of microphones and earphones, and
the performance information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold information
and the views posted by millions of people across the globe.
 Stock Exchange Data: The stock exchange data holds information about the
‘buy’ and ‘sell’ decisions made on a share of different companies made by the
customers.
 Power Grid Data: The power grid data holds information consumed by a
particular node with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance and
availability of a vehicle.
 Search Engine Data: Search engines retrieve lots of data from different
databases.

In order to handle this much of data apache has introduced a framework "Hadoop" which
is used to handle this much of data efficiently. BigData refers simply to the use of
predictive analytics, user behavior analytics, or certain other advanced data analytics
methods that extract value from data, and seldom to a particular size of data set. Data sets
are increasing rapidly as, The world's technological per-capita capacity to store

5
information has roughly doubled every 40 months since the 1980s as of 2012, every day
2.5 exabytes (2.5×1018) of data is created.

1.2 Apache Hadoop


Apache Hadoop is an open source software framework for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage apart, known as HADOOP
DISTRIBUTED FILE SYSTEM (HDFS), and a processing part called MapReduce.
Hadoop splits files into large blocks and distributes them across nodes in a cluster. To
process data, Hadoop transfers packaged code for nodes to process in parallel based on
the data that needs to be processed. This approach takes advantage of data locality, nodes
manipulating the data they have access to , to allow the dataset to be processed faster and
more efficiently than it would be in a more conventional supercomputer architecture that
relies on a parallel file system where computation and data are distributed via high-speed
networking.
The base Apache Hadoop framework is composed of the following modules:
 Hadoop Common- contains libraries and utilities needed by other Hadoop
modules.
 Hadoop Distributed File system (HDFS)- a distributed file system that stores data
on commodity machines, providing very high aggregate bandwidth across the
cluster.
 Hadoop YARN- a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of user's
applications.
 Hadoop MapReduce- an implementation of the MapReduce programming model
for large scale data processing.
The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache Flume, Apache Sqoop,
6
Apache Oozie, Cloudera Impala. Apache Hadoop's MapReduce and HDFS components
were inspired by Google papers on their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the java programming language, with
some native code in C and Command line utilities written as shell scripts. Though
MapReduce java code is common, any programming language can be used with "Hadoop
streaming" to implements the "map" and "reduce" parts of the user's program. Other
projects in the Hadoop ecosystem expose richer user interface.

1.3 Google File System


Google file system (GFS) is a proprietary distributed file system developed by Google for
its own use. It is designed to provide efficient, reliable access to data using clusters of
commodity hardware. A new version of the Google file system is codenamed Colossus
which was released in 2010.
GFS is enhanced for Google's core data storage and usage needs, which can generate
enormous amounts of data that needs to be retained. GFS grew out of an earlier Google
effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days of Google,
while it was still located in Stanford. Files are divided into fixed-size chunks of 64
megabytes, similar to clusters or sectors in regular file systems, which are only extremely
rarely overwritten, or shrunk. Files are usually appended to or read. It is also designed
and optimized to run on Google's computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken against the high failure
rate of individual nodes and the subsequent data loss. Other design decisions select for
high data throughputs, even when it comes at the cost of latency.
A GFS cluster consist of multiple nodes. These nodes are divided into two types, Master
node and a large number of Chunkservers. Each file is divided into fixed size chunks.
Chunkservers store these chunks. Each chunk is assigned a unique 64-bit label by the
master node at the time of creation, and logical mappings of files to constitute chunks are
maintained. Each chunk is replicated several times throughout the network, with the
minimum being three, but even more for files that have high end-in demand or need more
redundancy. The master server does not usually store the actual chunks, but rather all the
7
metadata associated with the chunks, such as the tables mapping the 64-bit labels to
chunk locations and the files they make up, the locations of the copies of the chunks,
what processes are reading or writing to a particular chunk, or taking a "snapshot" of the
chunk pursuant to replicate it. All this metadata is kept current by the Master server
periodically receiving updates from each chunk server.
Permissions for modifications are handled by a system of time-limited, expiring "leases",
where the Master server grants permission to a process for a finite period of time during
which no other process will be granted permission by the Master server to modify the
chunk. The modifying chunkserver, which is always the primary chunk holder, then
propagates the changes to the chunkservers with the backup copies. The changes are not
saved until all chunkservers acknowledge, thus guaranteeing the completion and
atomicity of the operation.
Programs access the chunks by first querying the Master server for the locations of the
desired chunks, if the chunks are not being operated on, the master replies with the
locations, and the program then contacts and receives the data from the chunkserver
directly.
Unlike most other file systems, GFS is not implemented in the kernel of an operating
system, but is instead provided as a user space library.

1.4 History
Cutting developed Hadoop with Mike Cafarella as the two worked on an open source
Web crawler called Nutch, a project they started together in October 2002. In January
2006, Cutting started a sub-project by carving Hadoop code from Nutch. A few months
later, in march 2006, Yahoo created its first Hadoop research cluster.
In the 10 years that followed, Hadoop has evolved into an open Source ecosystem for
handling and analyzing BigData. The first Apache release of Hadoop came in September
2007, and it soon became a top-level Apache project. Cloudera, the first company to
commercialize Hadoop, was founded in August 2008. That might seem like a speedy
timeline, but, in fact, Hadoop's evolution was neither simple nor fast. The goal for Nutch
was to download every page of the Web, store those pages, process them, and then
8
analyze them all to understand the links between the pages. "It was pretty clunky in
operation".
Cutting and Cafarella only had five machines to work with. Many manual steps were
needed to operate the system. There was no built-in reliability. If you lost a machine, you
lost data.
The break came from Google, when it published a paper in 2004 outlining MapReduce,
which allows users to manage large-scale data processing across a large number of
commodity servers. Soon, Cutting and Cafarella had Nutch running 20 machines. The
APIs they had crafted proved useful. "It was still unready for prime time".
Cutting joined Yahoo in January 2006, and the Company decided to invest in the
technology particularly the code that Cutting had carved out of Nutch, which was called
Hadoop, named after his son's stuffed elephant.
By 2008, Hadoop had a well-developed community of users. It became a top-level
Apache project, and Yahoo announced the launch of what was then the world's largest
Hadoop application. Cloudera was founded in August 2008 as the first company to
commercialize Hadoop.
Cutting has said, " We were not going to depend on any company or person, we got to
have technology that is useful".
Now that Hadoop has become more commonplace, two types of users have emerged. The
first are people "who find a problem they cannot solve any other way".
As an example, Cutting cited a credit company with a data warehouse that could only
store 90 days’ worth of information. Hadoop allowed the company to pool five years’
worth of data. Analysis revealed patterns of credit card fraud that could not be detected
within the shorter time limit.
The second type of user will apply Hadoop to solve a problem in a way that had not been
technically possible before, according to Cutting. Here he cited the example of a bank
that had to understand its total exposure to risk. It had a retail banking unit, a loan arm,
and an investment banking effort, each with its own backend IT system. The bank could
use Hadoop to "get data from all its systems into one system". From there IT could
normalize the raw data and experiment with different methods of analysis, "figuring out
the best way to describe risk".
9
Software Installation

2.1 VMware Workstation


VMware is an American company that provides cloud and virtualization software and
services and claims to be the first to successfully virtualizes the x86 architecture
commercially. Founded in 1998, VMware is based in Palo Alto, California.
VMware's desktop software runs on Microsoft windows, Linux, and Mac OS, while its
enterprise software hypervisors for servers.
VMware Workstation is a product of VMware launched in 1999. This software suit
allows users to run multiple instances of x86 or x86-64 compatible operating systems on
a single physical PC. It is a hosted hypervisor that runs on x64 versions of windows and
Linux operating system it enables users to set up virtual machines on a single physical
machine, and use them simultaneously along with the actual machine. Each virtual
machine can execute its own operating system, including versions of Microsoft windows,
Linux, MS-DOS. VMware workstation is developed and sold by VMware, a division of
EMC Corporation. An operating system license is needed to use proprietary ones such as
Windows. Readymade Linux VMs set up for different purposes are available.
VMware workstation supports bridging existing host network adapters and sharing
physical disk drives and USB devices with a Virtual machine. It can simulate disk drives,
an iso image file can be mounted as a virtual optical disc drive, and virtual hard disk
drives are implemented as .vmdk files.
VMware workstation Pro can save the state of a virtual machine at any instant. These
snapshots can later be restored, effectively returning the virtual machine to the saved
state, as it was and free from any post snapshot damage to the VM.
VMware workstation includes the ability to designate multiple virtual machines as a team
which can then be powered on, powered off, suspended or resumed as a single object,
useful for testing client-server environments.

10
2.1.1 Tools of VMware
VMware Tools, a package with drivers and other software, installs in guest operating
systems to increase their performance. It has several components, including the following
 Drivers for emulated hardware
 VESA compliant graphics for the guest machine to access high screen resolutions
 Network drivers for the vmxnet2 and vmxnet3 NIC
 Mouse integration
 Drag and drop file support between host and guest
 Clipboard sharing between host and guest
 Time synchronization capabilities
 Support for unity, a feature that allows seamless integration of applications with
the host desktop with workstation 12, windows 10 unity support was added, but
Linux no longer supported unity.
Third Party Resources
Ready to use virtual machines
Many readymade virtual machines which run on VMware Player, workstation, and other
virtualization software are available for specific purposes, either for purchase or free of
charge. For example, free Linux based "browser appliances" with the Firefox or other
browser installed which can be used for safe web browsing if infected or damaged it can
be discarded and replaced by a clean copy. the appliance can be configured to
automatically reset itself after each use so personal information and other changes are not
stored. Virtual machines distributed legally only have freely distributable operating
systems, as operating systems on virtual machines must be licensed, ready to use
Microsoft Windows virtual machines, in particular, are not distributed, except for
evaluation versions.

2.2 Ubuntu
Ubuntu is a Debian based Linux operating system and distribution for personal
computers, smart phones and network servers. It uses unity as its default user interface. It

11
is based on free software and named after Southern African philosophy of Ubuntu, which
Canonical Ltd suggests can be loosely translated as "humanity to others" or "I am what I
am because of who we all are".
A default installation of Ubuntu contains a wide range of software that includes
LibreOffice, Firefox, Thunderbird, Transmission and several lightweight games such as
Sudoku and chess. Many additional software packages are accessible from the built in
Ubuntu software center as well as any other APT-based package management tool. Some
of these packages are no longer available in the default installation.
Ubuntu operates under the GNU General Public License (GPL) and all of the application
software installed by default is free software. In addition, Ubuntu installs some hardware
drive that are available only in binary format, but such packages are clearly marked in the
restricted components.
Ubuntu's goal is to secure "out of the box". By default, the user's programs run with low
privileges and cannot corrupt the operating system or other users’ files. For increased
security, the sudo tool is used to assign temporary privileges for performing
administrative tasks, which allows the root account to remain locked and helps prevent
inexperienced users from inadvertently making catastrophic system changes or opening
security holes. PolicyKit is also being widely implemented into the desktop to further
harden the system. Most network ports are closed by default to prevent hacking. A built-
in firewall allows end users who install network servers to control access. A GUI (for
uncomplicated Firewall) is available to configure it. Ubuntu also supports full disk
encryption as well as encryption of the home and private directories.
Ubuntu has a certification system for third party software. Some third-party software that
does not limit distribution is included in Ubuntu's multiverse component. The package
Ubuntu-restricted-extras additionally contains software that may be legally restricted
extras additionally contains software that may be legally restricted, including support for
MP3 and DVD playback, Sun's java runtime environment, Adobe's Flash player plugin,
many common audio/video codecs, and unrar, an unarchiver for files compressed in the
RAR file format.

12
Additionally, third party applications suits are available for purchase through Ubuntu
software center, including many games such as Braid and Oil Rush, software DVD
playback and media codecs.

2.2.1 Installation of Ubuntu on VMware workstation


Steps for installing Ubuntu in VMware workstation:
Step1: open the VMware workstation and click on "Create a new virtual machine"

Figure2.2.1: Ubuntu installation step1


Step2: A installation window will open up click on "Typical" installation and then click
on "next"

13
Figure2.2.2: Ubuntu installation step 2
Step3: select the disc image of Ubuntu from the system and load it for installation and
then click on "next"

Figure2.2.3: Ubuntu installation step 3


Step4: Enter the details of user in the dialog appeared and then click on "next"

14
Figure2.2.4: Ubuntu installation step 4
Step5: Enter the location for installation of Ubuntu and then click on "next"

Figure2.2.5: Ubuntu installation step 5

Step6: Select storage type as "Store virtual disk on a single file" and then click on "next"

15
Figure2.2.6: Ubuntu installation step 6
Step7: An information dialog box will appear click on "finish" to start installation

Figure2.2.7: Ubuntu installation step 7

Step8: Installation process will start and a new window will open up

16
Figure2.2.8: Ubuntu installation step 8
Step9: Files for Ubuntu will copy and process of installation will goes on

Figure2.2.9: Ubuntu installation step 9

Step10: After the installation is finished below window will appear

17
Figure2.2.10: Ubuntu installation step 10
Installation of Ubuntu as a virtual machine using VMware on a Real machine (windows)
has been done.

2.3 Hortonworks Framework


Hortonworks is a business computer software company based in Santa Clara, California.
The company focuses on the development and support of Apache Hadoop, a framework
that allows for the distributed processing of large data sets across clusters of computers.
Hortonworks was formed in June 2011 funded $23 million from Yahoo and Benchmark
Capital as an independent company. The company employs contributors to the open
source software project apache Hadoop.
Hortonworks product named Hortonworks data platform (HDP) includes Apache Hadoop
and is used for storing, processing and analyzing large volumes of data. The platform is
designed to deal with data from many sources and formats. The platform includes various
apache Hadoop projects including the Hadoop distributed file system, MapReduce, Pig,
Hive, HBase and Zookeeper and additional components.

18
2.3.1 Hortonworks Sandbox
The sandbox is a straightforward, pre-configured, learning environment that contains the
latest developments from Apache Hadoop Enterprise, specifically Hortonworks Data
Platform (HDP) distribution. The sandbox comes with packaged in a virtual environment
that can run in the cloud or on personal machine using VMware. The sandbox allows us
to learn and explore HDP.

2.3.2 Importing sandbox in VMware


In order to start working in Hortonworks framework follow the following steps:
Step1: VMware should be installed on the system first.
Step2: Double click on Hortonworks_Sandbox_2.1.ova file stored in local disk of the
system. It will start importing the Hortonworks in VMware.
Step3: After the process ends a window will appear click on "Hortonworks_sandbox" on
the left hand side of VMware to open it.

Figure2.3.1: Hortonworks installation

Step4: After installation power on the framework

19
Figure2.3.2: Hortonworks

20
Hadoop Installation

3.1 Installation in Standalone mode


This mode generally does not require any configuration to be done. This mode is usually
used for Debugging purpose. All default configuration of Hadoop are done in this mode.
Step1: Firstly, update the Ubuntu by the command "sudo apt-get update" in the terminal

Figure3.1.1: Update
Step2: Install the default jdk by the command "sudo apt-get install default-jdk"

Figure3.1.2: Install jdk


Step3: check java version installed by the command "java -version"

Figure3.1.3: Java version


Step4: Install ssh localhost by the command "sudo apt-get install ssh "

21
Figure3.1.4: install ssh
Step5: Check ssh installed or not by typing command "ssh localhost"

Figure3.1.5: ssh localhost


ssh is Secure shell. This application allows us to get remote access of any machine (or
Local host) by different password other than root and also allows us to bypass the
password by setting it to empty. so, we need to set our ssh for password less
communication.
Step6: To make ssh password less enter the command ssh-keygen -t rsa -P ''

Figure3.1.6: ssh key


22
Please note that there are two single quotes after 'P' in command without space. After
entering this command, it will ask “Enter file in which to save the key
(/home/shubhi/.ssh/id_rsa):” press Enter without typing any single word. You will get
Image after entering this doing this, this image is called as randomart image. This image
will vary machine to machine and this key will be used to communicate between any two
machines for authentication. This command will create an RSA key pair with an empty
password. Generally, using an empty password is not recommended, but in this case, it is
needed to unlock the key without your interaction (you don’t want to enter the passphrase
every time Hadoop interacts with its nodes).
Step7: Save the generated key by the command "cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys"

Figure3.1.7: save key


Step8: check ssh login without password by the command "ssh localhost"

Figure3.1.8: ssh without password


Step9: Untar the tar file of hadoop on the desktop of Ubuntu and move the file to
/usr/local/hadoop by the command "sudo mv Desktop/hadoop-2.7.2 /usr/local/hadoop"

Figure3.1.9: Move hadoop


Step10: Now we need to set system environment variable so that our system identifies
Hadoop. To do this open bashrc file as a root in any text editor by the command " sudo
gedit ~/. bashrc".

23
Figure3.1.10: open bashrc file
Step11: Append the system environment variables in the end of bashrc file.
#Hadoop variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
#end of Hadoop variable declaration
Line 1: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 We are setting Java
installation path so that Hadoop can use this path where ever required.
Line 2: export HADOOP_INSTALL=/usr/local/hadoop this line is to identify installed
location of Java in the system.
Line3 to 8: these are Hadoop components locations, we are defining these to reduce or
work later, I will explain the use of these lines later in depth.

Figure:3.1.11: Bashrc file


Step11: Save the bashrc file permanently by the command "source ~/. bashrc"

24
Step12: Now check the Hadoop version in the terminal by the command "hadoop
version"

Figure3.1.12: Hadoop version


Step13: Update Java Home_path in /usr/local/hadoop/etc/hadoop/hadoop-env.sh file

Figure3.1.13: Java_Home
Now after this we have successfully installed Hadoop in standalone mode in our system.

3.2 Installation of Pseudo Distributed mode of Hadoop


This mode is also called single node mode. This mode needs little configuration. This
mode is used for Development purpose. This mode is also called single node mode. This
mode needs little configuration. This mode is used for Development purpose Hadoop is
by default is configured in Standalone mode. This standalone mode is used only for
debugging purpose but to develops any application we need to configure hadoop in
Pseudo Distributed mode. To configure hadoop in Pseudo Distributed mode we need to
edit following files
1)core-site.xml
2)hdfs-site.xml
3)mapred-site.xml
4)yarn-site.xml
All these files are present at "usr/local/hadoop/etc/hadoop".
Configuring core-site.xml

25
core site xml is a file containing all core property of hadoop. For example. Namenode url,
Temporary storage directory path, etc. Hadoop has predefined configuration which we
need to override them if we mention any of the configuration in core-site.xml then during
startup of hadoop, hadoop will read these configurations and run hadoop using this.
Open the file and append the lines in <configurations></configurations> tag
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/shubhi/tmp</value>
</property>
property 1: fs.defaultFS
This property overrides the default namenode url its syntax is hdfs://<ip-address of
namenode>:<port number>. This property was named as fs.default.name in hadoop 1.x.x
version. Note: Port number can be any number above 255 to 65536
property 2: hadoop.tmp.dir
This property is used to change the temporary storage directory during execution of any
algorithm in hadoop by default its location is “/tmp/hadoop-${user.name}” in my case I
have created this directory in my home folder name tmp so its “/home/mohamadali/tmp”.

26
Figure3.2.1: ore-site.xml
Configuring hdfs-site.xml
This file contains all configuration about hadoop distributed file system also called as
HDFS such as storage location for namenode, storage location for datanode, replication
factor of HDFS, etc.
Open the file and append the lines in <configurations></configurations> tag
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/shubhi/tmp/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/shubhi/tmp/datanode</value>
</property>
Property 1: dfs.replication

27
This property overrides the replication factor in hadoop. By default, its value is 3 but in
single node cluster it is recommended to be 1.
Property 2: dfs.namenode.name.dir
Central repository using dynamic data node allocation in multi node HDFS This property
overrides storage location of namenode data by default its storage location is inside
“/tmp/hadoop-${user.name}”. To change this, you have set value of your folder location
in my case it is inside tmp directory created during core-site.xml
Property 3: dfs.datanode.data.dir
This property overrides storage location of datanode data by default its storage location is
inside “/tmp/hadoop-${user.name}”. To change this, you have set value of your folder
location in my case it is also inside tmp directory created during core-site.xml

Figure3.2.2: hdfs-site.xml
Configuring mapred-site.xml
This file contains all configuration about Map Reduce component in hadoop. Please note
that this file doesn't exist but you can copy or rename it from mapred-site.xml.template.
Open the file and append the lines in <configurations></configurations> tag
<property>
28
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
As we know that from hadoop 2.x.x hadoop has introduced new layer of technology
developed by hadoop to improve performance of map reduce algorithm this layer is
called as “yarn” that is Yet Another Resource Negotiator. So here we are configuring that
our hadoop framework is yarn if Central repository using dynamic data node allocation in
multi node HDFS we don't specify this property then our hadoop will use Map reduce 1
also called as MR1.

Figure3.2.3: mapred-site.xml
Configuring yarn-site.xml
This file contains all information about YARN as we will be using MR2 we need to
specify the auxiliary services that need to be used with MR2.
Open the file and append the lines in <configurations></configurations> tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
29
Figure3.2.4: yarn-site.xml
Now all 4 files have been configured, next step is to format the namenode using the
command " hdfs namenode -format"

Figure3.2.5: namenode format


Namenode has been formatted successfully. Now we need to start our services using the
command "start-dfs.sh" and "start-yarn.sh". These two commands will start all services of
30
hadoop in Ubuntu. Else we can use a deprecated command too, to start all services of
Hadoop in Ubuntu i.e. "start-all.sh".

Figure3.2.6: start-yarn.sh

Figure3.2.7: start-all.sh
In order to check if all services started or not use the command "jps"

Figure3.2.8: jps
If the jps shows all these 6 services then we have installed hadoop in pseudo distributed
mode successfully.

31
Installation of Pig and Hive

4.1 Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop.
The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in
MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from
the java MapReduce idiom into a notation which makes MapReduce programming high
level, similar to that of SQL for RDBMS. Pig Latin can be extended using User Defined
Functions (UDF) which the user can write in java, python, javascript, ruby and then call
directly from the language.
Apache Pig was originally developed at Yahoo Rsearch Around 2006 for researches to
have an ad-hoc way of creating and executing MapReduce jobs on very large data sets. In
2007, it was moved into the Apache Software Foundation.

4.1.1 Pig vs SQL


In comparison to SQL, Pig
 uses lazy evaluation
 uses extract, transform, load (ETL)
 is able to store data at any point during pipeline
 declares execution plans
 supports pipelines splits, thus allowing workflows to proceed along DAGs instead
of strictly sequential pipelines.
On the other hand, it has been argued DBMSs are substantially faster than the
MapReduce system once the data is loaded, but that loading the data takes considerably
longer in the database systems. It has also been argued RDBMSs offer out of the box
support for column-storage, working with compressed data, indexes foe efficient random
data access, and transaction-level fault tolerance.
Pig Latin is procedural and fits naturally in the pipeline paradigm while SQL is instead
declarative. In SQL users can specify that data from two tables must be joined, but not

32
what join implementation to use. Pig Latin allows users to specify an implementation or
aspects of an implementation to be used in executing a script in several ways. In effect,
Pig Latin programming is similar to specifying a query execution plan, making it easier
for programmers to explicitly control the flow of their data processing task.
SQL is oriented around queries that produce a single result. SQL handles trees naturally,
but has no built-in mechanism for splitting a data processing stream and applying
different operators to each sub-stream. Pig Latin script describes a directed acyclic graph
rather than a pipeline.
Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline
development. If SQL is used, data must first be imported into the database, and then the
cleansing and transformation process can begin.

4.2 Installation of Pig


Step1: Untar the file on the desktop and move the file to "usr/lib/lib"

Figure4.2.1: move pig


Step2: Edit bashrc file

Figure4.2.2: pig path

Step3: permanently save Bashrc file using the command "source ~/.bashrc"
Step4: Now open the "grunt>" shell of pig.
1) Local mode- "pig -x local"

33
Figure4.2.3: local mode
2) MapReduce Mode- "pig"

Figure4.2.4: MapReduce mode

4.3 Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis. While developed by Facebook, Apache Hive is now
34
used and developed by other companies such as Netflix and the financial industry
regulatory authority. Amazon maintains a software fork of Apache Hive that is included
in Amazon elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop HDFS and compatible
file system such as Amazon S3 filesystem. It provides an SQL -like language called
HiveQL with schema on read and transparently converts queries to MapReduce, Apache
TEz and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate
queries, it provides indexes, including bitmap indexes.
By default, Hive stores metadata in an embeded Apache Derby database, and other
client/server database like MySQL can optionally be used.
Four file format are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC
and RCFILE.

4.3.1 HiveQL
While based on SQL, HiveQl does not follow the full SQL-92 standard. HiveQl offers
extensions not in SQL, including multitable inserts and create table as select, but only
offers basic support for indexes. Also, HiveQl lacks support for transactions and
materialized views, and only limited subquery support. support for insert, update and
delete with full ACID functionality was made available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of
MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.

4.4 Hive installation


In order to install Hive, follow the following steps.
Step1: Untar the file on desktop of hive.
Step2: Move the file from desktop to /usr/lib/hive using the command "sudo mv apache-
hive-2.0.1-bin /usr/lib/hive"

35
Figure4.4.1: move hive
Step2: Edit bashrc file to add permission for hive

Figure4.4.2: bashrc file


Step3: permanentaly save the bashrc file by the command "source ~/.bashrc"
Step4: Start the schemtool if it is completed then hive is installed directly go to the hive
terminal. otherwise move the metatsore in order to start schematool. even if it won't work
stop all service and then start schematool and move metastore and then start all service
again.
write hive on the terminal and enter into the hive shell.

36
Figure4.4.3: schematool failed

Figure4.4.4: schematool completed

37
Figure4.4.5: hive shell

38
Sentimental Analysis in Hive
5.1 Adding Serde
As the tweets coming in from twitter are in Json format, we need to load the tweets into Hive
using json input format. We will use Cloudera Hive json serde for this purpose.
Hive-serde-1.0-SNAPSHOT.jar

ADD jar 'path of the jar file';

5.2 Analysis Part


For performing Sentiment Analysis, we need the tweet_id and tweet text, so we will create a
Hive table that will extract the id and tweet text from the tweets using the Cloudera Json
serde.

Our tweets are stored in the ‘/user/flume/tweets/ ‘directory of HDFS.

5.3 Creating External Table for Tweets Storage


create an external table in Hive in the same directory where our tweets are present i.e.,
‘/user/flume/tweets/’, so that tweets which are present in this location will be automatically
stored in the Hive table.

The command for creating a Hive table to store id and text of the tweets is as follows:

39
create external table load_tweets(id BIGINT,text STRING) ROW FORMAT
SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
Next, we will split the text into words using the split () UDF available in Hive. If we use the
split() function to split the text as words, it will return an array of values. So, we will create
another Hive table and store the tweet_id and the array of words.

create table split_words as select id as id,split(text,' ') as words from load_tweets;

select * from split_words;

split each word inside the array as a new row. For this we need to use a UDTF(User Defined
Table Generating Function). We have built-in UDTF called explode which will extract each
element from an array and create a new row for each element.

create table tweet_word as select id as id,word from split_words LATERAL VIEW


explode(words) w as word;

use a dictionary called AFINN to calculate the sentiments

create a table to load the contents of AFINN dictionary.

create table dictionary(word string,rating int) ROW FORMAT DELIMITED FIELDS


TERMINATED BY '\t';

LOAD DATA INPATH '/AFINN.txt' into TABLE dictionary;

40
join the tweet_word table and dictionary table so that the rating of the word will be joined
with the word.

create table word_join as select tweet_word.id,tweet_word.word,dictionary.rating from


tweet_word LEFT OUTER JOIN dictionary ON(tweet_word.word =dictionary.word);

perform the ‘groupby’ operation on the tweet_id so that all the words of one tweet will come
to a single place. And then, we will be performing an Average operation on the rating of the
words of each tweet so that the average rating of each tweet can be found.

select id,AVG(rating) as rating from word_join GROUP BY word_join.id order by rating


DESC;

41
5.5 Final Result

42

You might also like