Big Data Thouraya Hadj Hassen SIC

Hadj hassen thouraya 3SIC

1. Creating a VM:

2. Accessing the VM terminal on the port 4200:

3. Accessing the Ambari dashboard after resetting the

login and pwd to admin/admin
4. Creating the data folder in /tmp/ and adding the

datasets “truck.csv” and “geolocation.csv”:

5. Entering the Hive interface and adding the

We have to tick the “is first row header?” for the data to
work properly

This a preview of each column and the type associated:

we do the same with “truck.csv”)


6. Sample Data from the trucks table:

7. Beeline - Command Shell: Connect to Beeline hive

and Enter the beeline commands to grant all
permission access for “admin” user:
Enter the beeline commands to view 10 rows from

foodmart database customer and account tables:

8. Create Table truckmileage From Existing Trucking

9. Explore a sampling of the data in the truckmileage


Saving the query used as “average-mpg”:

10. Explore Explain Features of the Hive Query Editor:

11. Create Table avgmileage From Existing

trucks_mileage Data:
12. Create Table DriverMileage from Existing

truckmileage data and view Sample Data of
avgmileage: (forgot to take screenshot of the creation

of “drivemillage”)

13. Exporting drivemillage into .csv file:

14. Open Zeppelin interface using URL: a Spark2 notebook and
initiate the instance
15. Read CSV Files into Apache Spark:

16. Import CSV data into a data frame with a user

defined schema:
17. Query Tables To Build Spark RDD:

18. Querying Against Registered Temporary Tables:

19. Perform join Operation:

20. Compute Driver Risk Factor:

21. Data Reporting With Zeppelin: Import the Data and

Visualize final results Data in Tabular Format:
22. Build Charts using Zeppelin:

