Professional Documents
Culture Documents
Big Data Analysis Lab File: Objective-Design A Word Count Application Using Mapreduce Programming Model. Theory
Big Data Analysis Lab File: Objective-Design A Word Count Application Using Mapreduce Programming Model. Theory
LAB FILE
Experiment-1
Theory-In MapReduce word count example, we find out the frequency of each word. Here, the
role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys
of common values. So, everything is represented in the form of Key-value pair.
Procedure-
1. Copy a local text file to hadoop.
3. Use word count on the given text file and give an output file for it.
5. Then open it further to reveal a success file and the output file.
Result-
Conclusion- The word count application has successfully been run via mapreduce and the output is
available.
Experiment-2
Objective- Hive table creation and alterations to it.
Theory- Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. Hive allows users to read, write, and manage petabytes of data using
SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets.
Procedure-
2. Then create database, view the database and finally set it as default.
4. Open another terminal and open the csv file there to view its contents.
5. Use pwd to get its path and gedit to be able to edit the contents.
6. Now in the Hive shell create the employee table and type out its schema.
7. Use the address of the csv file from the other terminal and load it into the table here.
Results-
Conclusion-Table sucessfully created in hive, data was added to it and queries performed.
Experiment-3
Objective-Joining two datasets with a common column using Hive.
Procedure-
3. Create 2 table schemas and input data to them from the csv files.
Results-
The two tables-
Experiment-4
Objective-Performing Partioning via Hive.
Theory-
Hive partitioning – 1. Static or manual Partitioning
2. Dynamic Partitioning.
Static Partitioning – When you know the datafiles belong to which partition, accordingly you arrange
them in a table. You have to do this partition manually.
Dynamic Partitioning – Hive will do partitioning based on one or multiple columns of your data.
Procedure and Result-
STATIC PARTITIONING-
1. Create two csv files one for students studying python and other hadoop.
4. Load the two csv files into the table but using partition of year and course for both.
5. Check the hue browser for static partitionig result.
DYNAMIC PARTITIONING-
1. Create a new csv file and hive will do the everything in dynamic partitioning.
8. Check partitioned files based on course and yesr in the hue browser.
Experiment-5
1.