Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

BIG DATA ANALYSIS

LAB FILE

Experiment-1

Objective- Design a Word Count application using MapReduce programming model.

Theory-In MapReduce word count example, we find out the frequency of each word. Here, the
role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys
of common values. So, everything is represented in the form of Key-value pair.

Procedure-
1. Copy a local text file to hadoop.

2. Run Word Count from MapReduce.

3. Use word count on the given text file and give an output file for it.

4. Once completed list out the files in hadoop to find it.

5. Then open it further to reveal a success file and the output file.

6. Copy that file to the local and give it a new name.

7. Use more to view the results.

Result-
Conclusion- The word count application has successfully been run via mapreduce and the output is
available.

Experiment-2
Objective- Hive table creation and alterations to it.

Theory- Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. Hive allows users to read, write, and manage petabytes of data using
SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets.

Procedure-

1. Type Hive in the terminal to initialize the Hive shell.

2. Then create database, view the database and finally set it as default.

3. Create a separate csv file in documents to use.

4. Open another terminal and open the csv file there to view its contents.

5. Use pwd to get its path and gedit to be able to edit the contents.

6. Now in the Hive shell create the employee table and type out its schema.
7. Use the address of the csv file from the other terminal and load it into the table here.

8. Now sql operations can be performed on the table.

Results-

Operations on the created table in hive.

Conclusion-Table sucessfully created in hive, data was added to it and queries performed.

Experiment-3
Objective-Joining two datasets with a common column using Hive.

Procedure-

1. Create 2 csv files.

2. Open an existing database.

3. Create 2 table schemas and input data to them from the csv files.

4. Join the two tables via the common column.

5. Creating a external type table as it cant be deleted by another user.

Results-
The two tables-

The resultant joined table-

Experiment-4
Objective-Performing Partioning via Hive.
Theory-
Hive partitioning – 1. Static or manual Partitioning
2. Dynamic Partitioning.
Static Partitioning – When you know the datafiles belong to which partition, accordingly you arrange
them in a table. You have to do this partition manually.
Dynamic Partitioning – Hive will do partitioning based on one or multiple columns of your data.
Procedure and Result-
STATIC PARTITIONING-
1. Create two csv files one for students studying python and other hadoop.

2. Create a new database and use it.

3. Create a new table and use partitioned by.

4. Load the two csv files into the table but using partition of year and course for both.
5. Check the hue browser for static partitionig result.

DYNAMIC PARTITIONING-
1. Create a new csv file and hive will do the everything in dynamic partitioning.

2. Create a new database.

3. Set for dynamic partitioning.


4. Create a temporary table to store initial data.

5. Load the local csv file into the temp table.

6. Create a table for partitioning.

7. Transfer data from temporary file to new file.

8. Check partitioned files based on course and yesr in the hue browser.
Experiment-5

Objective-Running Pig and queries.

Procedure and Results-

1.

You might also like