Professional Documents
Culture Documents
MBDHC 2
MBDHC 2
CDOSS Certificate
Managing Big Data in Hadoop
Cluster
© Dr Heni Bouhamed
Big Data Trainer
Senior Lecturer at Sfax University
Senior Lecturer at ESTYA University (France)
Cloudera Instructor at Elitech Paris
Certified CDOSS Big Data Instructor
Heni.bouhamed@fsegs.usf.tn
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Browsing Files :
2
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q1: Use the hive shell or beeline to view the tables in the default
database. Which of the following tables are in the default database?
Check all that apply.
• crayons
• customers
• employees
• games
• inventory
• makers
• offices
• orders
• salary_grades
3
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q2: Use hive shell or beeline to view the columns in the orders table in the default database. Which of
the following columns are in the orders table? Check all that apply.
• Country
• cust_id
• empl_id
• name
• office_id
• order_id
• total
4
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q3: Once you're in hive shell or beeline, one way to get the schema of
the makers table is to use:
• Select the toy database with use toy; and then run the command
SHOW DATABASE;
• Select the toy database with use toy; and then run the command
DESCRIBE TABLES;
• Select the toy database with use toy; and then run the command
SHOW TABLES;
• Select the toy database with use toy; and then run the command
DESCRIBE makers;
• Select the toy database with use toy; and then run the command
SHOW TABLE makers;
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q4: By default, what directory in HDFS would store the data for a table
named characters in a database named muppets?
• /user/hive/warehouse/muppets.db/characters
• /muppets.db/characters
• /user/hive/warehouse/characters
• /muppets/characters
• /user/hive/warehouse/muppets/characters
6
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q5: By default, what directory in HDFS would store the data for a table
named incidents in the default database?
• /default.db/incidents
• /user/hive/warehouse/incidents
• /user/hive/warehouse/default/incidents
• /user/hive/warehouse/default.db/incidents
• /default/incidents
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• comma (,)
• tab
8
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q7: Which command will print the contents of a file in HDFS to the terminal screen?
CDOSS Certificate
Managing Big Data in Hadoop Cluster
10
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Non-strict coupling: with Hive you can define several external tables with the same file (with different
delimiters for example ...)
- Drop database [if exists] db_name; if it contains tables : drop database db_name cascade;
11
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Modifications : alter table tab_name action settings
- Rename table : alter table customers rename to clients; internally the folder will be renamed
- Move table to another data base : alter table default.clients rename to dig.clients;
(internally the folder will be moved, externally only the metadata will be moved)
- Change column name and / or data type :
Alter table employees change first_name fname string;
We can do both
Alter table employees change salary salary bigint;
- Change the column(s) order :
Alter table employees change salary salary int after office_id;
Alter table employees change salary salary int first;
(Changing the order does not change the data)
- Replace all columns : alter table tab_name replace columns(col1 type1,…);
- Change an internal table to an external one :
alter table tab_name SET TBLPROPERTIES(‘EXTERNAL’=‘TRUE’); II.8*
12
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
13
CDOSS Certificate
Managing Big Data in Hadoop Cluster
14
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q3 Which statement correctly sets the file type for newtable as a comma-delimited text file?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT '.csv';
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\c';
• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT CSV;
15
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q4 Which statement correctly uses a SerDe for reading and writing newtable's data?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
16
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q5 Which statement shows correct syntax for skipping a header line in each file of a table's data?
17
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Data Type
18
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Examine the types of data: • Describe formatted : column type and more details of the table
Out of range:
Decimal NULL
Float+double -infinity ou infinity
Int NULL
19
CDOSS Certificate
Managing Big Data in Hadoop Cluster
II.9*
File types : textfile AvroFile Parquet ORCFILE
Parquet : excellent interoperability | excellent performance | column oriented | human reading not possible | possible
compression | direct loading not possible
OrcFile : low interoperability | excellent performance | column oriented | human readability not possible | compression
possible | direct loading not possible
Sequence file : linked to java and hadoop, good performance but poor interoperability
RCfile: columnar textfile, low interoperability, low performance (original version of ORCfile)
20
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Choice of format
21
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q1 Which of the following are integer types used by Hive? Check all
that apply.
• LARGEINT
• TINYINT
• BIGINT
• BYTEINT
• INT
22
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q2 Which of these data types has the greatest range of possible values?
(Note that this question is asking about range, not precision.)
• DOUBLE
• FLOAT
• DECIMAL
23
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• CHAR(2)
• VARCHAR(2)
• STRING
24
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q4 Suppose you have data that looks similar to the following: an ID code (always containing
digits with a non-zero leading digit), a name, an order code (could contain letters or digits),
and a total amount in US dollars and cents (seven digits or less).
83928,Carey Myagi,C83820K,28.09
Which of the following would be the best choice of data types for these four columns (in
the same order)?
• TINYINT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, FLOAT
• TINYINT, STRING, STRING, DECIMAL(7,2)
• INT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, FLOAT
• INT, STRING, STRING, DECIMAL(7,2)
25
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Use the data source (left assist) panel in Hue to see the table schema
• Use DESCRIBE SCHEMA table_name; to see the table schema
• Use the Table Browser in Hue to see the table schema
• Use SHOW CREATE TABLE table_name; to see the table schema
• Use DESCRIBE table_name; to see the table schema
• Use SHOW TYPE column_name FROM table_name; to get the type of column_name
• Use SELECT typeof(column_name) FROM table_name; to get the type of column_name
(Impala only)
26
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Efficient storage
27
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q7 Which are important considerations when choosing a file type for storing data?
• Data size and query performance
• Ingest pattern
• Interoperability needs
28
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with command line
We have already seen : *hdfs dfs –ls *hdfs dfs –cat *hdfs dfs –get (hdfs to local)
Exercice:
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/ancient_games.csv
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –cp /old/games/ancient_games.csv /user/hive/warehouse/fun.db/games/
(leave unnamed if you want to keep the same name)
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/* (to delete everything)
Hdfs dfs –rm –r /path (to delete a folder)
Hdfs dfs –mv /path… /path… (to move)
Hdfs dfs –put /home/training/training_materials/analyst/data/games.csv /user/hive/warehouse/fun.db/games/
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –mkdir /path/dir (-p to create the parent folder (s))
29
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Sqoop (RDBMS to Hadoop and the opposite)
Example : sqoop import --connect jdbc:mysql://localhost/company –username training –password training –table customers
30
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Hive
Insert
• Easier loading with SQL
Advantages
• Data processing possible during loading Load data
I/ Load Files are added to others if they already exist, in case of the same name, the new ones will be
renamed
To overwrite what exists, use overwrite : Load data inpath ‘/path/dir/file.ext’ overwrite into
table name;
31
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Hive
Insert into table table_name values(value1,value2…),(value n,value n+1…)
Add files without deleting old ones
Adding small fragments of data (<64MB) creates a lot of small files, which will
II/ Insert make processing more cumbersome
Insert overwrite table table_name select * from tablex; (loading all the data)
Optional
CTAS : Create Table As Select
III/ CTAS Create table chicago_emp row format delimited fields terminated by ‘,’
As select * from employees where office_id=‘b’;
32
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q1 Which hdfs dfs subcommand can be used to place a file that's already in
HDFS into a different directory, without also leaving it in the original
directory?
• -cp
• -get
• -ls
• -mv
• -put
• -rm
33
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• NULL
• \N
34
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q3
This command uses Sqoop to import data from MySQL to HDFS on the
VM (with the user root).
$ sqoop import --connect jdbc:mysql://localhost/mydb \
--username root --password hadoop \
--table example_table
Where in HDFS will the data be found?
• /user/hive/warehouse/mydb.db/example_table
• /user/training/example_table
• /user/training/mydb.db/example_table
• /user/hive/warehouse/example_table
35
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• You can use one command to import all tables in a database to HDFS
36
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q5 Which are valid commands for loading data into tables using Hive?
• LOAD DATA INPATH '/path/to/file.ext' INTO TABLE table;
37
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Small files can't be compressed much, so they take too much space to
store
• Small files slow the system down more than (fewer) big files holding the
same amount of data
• Small files don't have enough data for the big data processes to work
with
38
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q7 It's possible to create a table and then populate it with data from
another table using INSERT...SELECT. You can also perform these two
steps with a single command. What is that technique called?
• Simple cloning
• CTAS
• Modified cloning
• CREATE INSERT
• CREATE SELECT
• CTIS
39
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Views
• Simplify access to tables for frequent queries
• Restrict access to certain column (security)
Not materializable (Hive will materialize them from 3.0)
II.14*
40
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Partitioning :
* The data is very large and the queries too slow
* The data is already stored in folders according to the partition column (without it being in the data)
* Most queries are for one or two columns
41
CDOSS Certificate
Managing Big Data in Hadoop Cluster
42
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
43
CDOSS Certificate
Managing Big Data in Hadoop Cluster
44
28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• ARRAY
• STRUCT
• MAP
45
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• They store data in files that you can list and directly access
• They abstract away the details of how the data is stored in files using specific file formats
• They are the systems most often used to store data for Hive
46