MBDHC 2

28/04/2023
CDOSS Certificate
Managing Big Data in Hadoop
Cluster
© Dr Heni Bouhamed
Big Data Trainer
Senior Lecturer at Sfax University
Senior Lecturer at ESTYA University (France)
Cloudera Instructor at Elitech Paris
Certified CDOSS Big Data Instructor
Heni.bouhamed@fsegs.usf.tn
CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Browsing Files :
Hdfs dfs –ls /path

Commande line Hdfs dfs –get /path/to/hdfs /path/to/local
Hdfs dfs –cat /path
2
28/04/2023
CDOSS Certificate
Q1: Use the hive shell or beeline to view the tables in the default
database. Which of the following tables are in the default database?
Check all that apply.
• crayons
• customers
• employees
• games
• inventory
• makers
• offices
• orders
• salary_grades
3
CDOSS Certificate
Q2: Use hive shell or beeline to view the columns in the orders table in the default database. Which of
the following columns are in the orders table? Check all that apply.
• Country
• cust_id
• empl_id
• name
• office_id
• order_id
• total
4
28/04/2023
CDOSS Certificate
Q3: Once you're in hive shell or beeline, one way to get the schema of
the makers table is to use:
• Select the toy database with use toy; and then run the command
SHOW DATABASE;
DESCRIBE TABLES;
SHOW TABLES;
DESCRIBE makers;
SHOW TABLE makers;
CDOSS Certificate
Q4: By default, what directory in HDFS would store the data for a table
named characters in a database named muppets?
• /user/hive/warehouse/muppets.db/characters
• /muppets.db/characters
• /user/hive/warehouse/characters
• /muppets/characters
• /user/hive/warehouse/muppets/characters
6
28/04/2023
CDOSS Certificate
Q5: By default, what directory in HDFS would store the data for a table
named incidents in the default database?
• /default.db/incidents
• /user/hive/warehouse/incidents
• /user/hive/warehouse/default/incidents
• /user/hive/warehouse/default.db/incidents
• /default/incidents
CDOSS Certificate
Q6: What delimiter is used to separate the values in the lines of

the text file containing the data in the crayons table in the wax
database? (Hint: You need to go look at the file in the VM.)
• comma (,)
• tab
8
28/04/2023
CDOSS Certificate
Q7: Which command will print the contents of a file in HDFS to the terminal screen?
• hdfs dfs ls /path/to/file
• hdfs dfs -get /path/to/file
• hdfs dfs cat /path/to/file
• hdfs dfs get /path/to/file
• hdfs dfs -cat /path/to/file
• hdfs dfs -ls /path/to/file
CDOSS Certificate
• DDL with Hive
External Table DDL advanced options: II.5*

Internal Table
• TBLPROPERTIES:
Clauses Example:
-‘separator’
Row format delimited fields terminated by - by default ASCII control+A
TBLPROPERTIES(‘skip.header.line.count’=‘1’)
II.1* alter table name SET TBLPROPERTIES(‘serialization.null.format’=‘…’);
- To choose a format • Hive serdes: II.6* II.10*
Stored as Example:
- Default textfile
II.2* Row format serde ’org.apache.hadoop.hive.serde2….’
- - Force the creation location outside the warehouse
Location - - Creation with existing data outside the warehouse
II.3* - - Several external table and they can be used together
- - Without this clause, the creation will be in the Hive warehouse • LazySimpleSerde (by default)
• openCSVSerde
IF NOT EXIST  test the existence of the table before creation • RegexSerDe -Sep in fileds
DDL Options • JsonSerDe -Missing values
II.4* LIKE  clone the schema of a table but without the data
10
28/04/2023
CDOSS Certificate
 Non-strict coupling: with Hive you can define several external tables with the same file (with different
delimiters for example ...)
• Describe Table II.7*

- Describe table_name;
- Describe formatted table_name; (more details concerning location and format)
- Show create table table_name; (creation details)
 Delete database and Table :

- Drop table [if exists] table_name; External : metadata
Internal : metadata, data et directory (according to access right)
- Drop database [if exists] db_name;  if it contains tables : drop database db_name cascade;
11
CDOSS Certificate
• Modifications : alter table tab_name action settings
- Rename table : alter table customers rename to clients; internally the folder will be renamed
- Move table to another data base : alter table default.clients rename to dig.clients;
(internally the folder will be moved, externally only the metadata will be moved)
- Change column name and / or data type :
Alter table employees change first_name fname string;
We can do both
Alter table employees change salary salary bigint;
- Change the column(s) order :
Alter table employees change salary salary int after office_id;
Alter table employees change salary salary int first;
(Changing the order does not change the data)
- Replace all columns : alter table tab_name replace columns(col1 type1,…);
- Change an internal table to an external one :
alter table tab_name SET TBLPROPERTIES(‘EXTERNAL’=‘TRUE’); II.8*
12
28/04/2023
CDOSS Certificate
Q1 A new database and table are created using the following

statements. Where in HDFS is the data for this table located? CREATE
DATABASE thisdb; CREATE TABLE thisdb.thistable (col1 TINYINT, col2
INT);
• /user/hive/thistable
• /user/hive/warehouse/thisdb.db/thistable
• /user/warehouse/thistable
• /user/thisdb.db/thistable
• /user/hive/warehouse/thistable
• /user/thistable
• /user/warehouse/thisdb.db/thistable
• /user/hive/thisdb.db/thistable
13
CDOSS Certificate
Q2 Which statement or statements correctly set /dualcore/newtable/ as the directory that

holds the data for a table called newtable? Check all that apply.
• CREATE TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';
• CREATE TABLE newtable (col1 STRING, col2 INT) EXTERNAL '/dualcore/newtable/';
• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';
• CREATE TABLE newtable (col1 STRING, col2 INT) LOCATION '/dualcore/newtable/';
• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) LOCATION

'/dualcore/newtable/';
14
28/04/2023
CDOSS Certificate
Q3 Which statement correctly sets the file type for newtable as a comma-delimited text file?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT '.csv';
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\c';
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS CSV;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS TEXTFILE.CSV;
• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT CSV;
15
CDOSS Certificate
Q4 Which statement correctly uses a SerDe for reading and writing newtable's data?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
16
28/04/2023
CDOSS Certificate
Q5 Which statement shows correct syntax for skipping a header line in each file of a table's data?
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TBLPROPERTIES

'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TABLEPROPERTIES
('skip.header.line.count'='1');
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TBLPROPERTIES
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TBLPROPERTIES ('skip.header.line.count'='1');
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TABLEPROPERTIES 'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TABLEPROPERTIES
'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TBLPROPERTIES 'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TABLEPROPERTIES
17
CDOSS Certificate
Data Type
18
28/04/2023
CDOSS Certificate
• Describe: column and type
• Examine the types of data: • Describe formatted : column type and more details of the table
• Show create table: table definition (column and data type)
 Out of range:
Decimal  NULL
Float+double  -infinity ou infinity
Int  NULL
19
CDOSS Certificate
II.9*
 File types : textfile AvroFile Parquet ORCFILE
Textfile : excellent interoperability | low performance | human readability possible

No compression | direct loading possible
Avrofile : binary serialization | supported by many tools* | excellent interoperability | excellent

performance | human reading not possible
Using Integrated Schema | direct loading not possible
Parquet : excellent interoperability | excellent performance | column oriented | human reading not possible | possible
compression | direct loading not possible
OrcFile : low interoperability | excellent performance | column oriented | human readability not possible | compression
possible | direct loading not possible
Sequence file : linked to java and hadoop, good performance but poor interoperability
RCfile: columnar textfile, low interoperability, low performance (original version of ORCfile)
20
28/04/2023
CDOSS Certificate
• Choice of format
Ingest patterns • In block  parquet: uses patterns to store efficiently

• Not in block  not in parquet
Need for interoperabilty  parquet/textfile
Data lifetime • Good interoperability

• Preferably data with integrated and Parquet/Avro
scalable schema
Data size  file compression  parquet/orcfile
21
CDOSS Certificate
Q1 Which of the following are integer types used by Hive? Check all
that apply.
• LARGEINT
• TINYINT
• BIGINT
• BYTEINT
• INT
22
28/04/2023
CDOSS Certificate
Q2 Which of these data types has the greatest range of possible values?
(Note that this question is asking about range, not precision.)
• DOUBLE
• FLOAT
• DECIMAL
23
CDOSS Certificate
Q3 The International Standard ISO-3166-1993 provides a unique two-

letter code for each country on its list. For example, Japan is JP, India is
IN, and Indonesia is ID. Each of the following string types could be used
for a column that holds values from this list. Which is best to use?
• CHAR(2)
• VARCHAR(2)
• STRING
24
28/04/2023
CDOSS Certificate
Q4 Suppose you have data that looks similar to the following: an ID code (always containing
digits with a non-zero leading digit), a name, an order code (could contain letters or digits),
and a total amount in US dollars and cents (seven digits or less).
83928,Carey Myagi,C83820K,28.09
Which of the following would be the best choice of data types for these four columns (in
the same order)?
• TINYINT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, FLOAT
• TINYINT, STRING, STRING, DECIMAL(7,2)
• INT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, FLOAT
• INT, STRING, STRING, DECIMAL(7,2)
25
CDOSS Certificate
Q5 Which are valid methods to find the data types of a column?
• Use the data source (left assist) panel in Hue to see the table schema
• Use DESCRIBE SCHEMA table_name; to see the table schema
• Use the Table Browser in Hue to see the table schema
• Use SHOW CREATE TABLE table_name; to see the table schema
• Use DESCRIBE table_name; to see the table schema
• Use SHOW TYPE column_name FROM table_name; to get the type of column_name
• Use SELECT typeof(column_name) FROM table_name; to get the type of column_name
(Impala only)
26
28/04/2023
CDOSS Certificate
Q6 Which are advantages of using Text Files?
• Columnar format (which improves performance for some access

patterns)
• Efficient storage
• Good interoperability (used by many applications)
• Easily read by humans
27
CDOSS Certificate
Q7 Which are important considerations when choosing a file type for storing data?
• Data size and query performance
• Names of the columns in the data
• On-premises or cloud storage
• Ingest pattern
• Interoperability needs
28
28/04/2023
CDOSS Certificate
• Loading data with command line
We have already seen : *hdfs dfs –ls *hdfs dfs –cat *hdfs dfs –get (hdfs to local)
Exercice:
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/ancient_games.csv
Hdfs dfs –cp /old/games/ancient_games.csv /user/hive/warehouse/fun.db/games/
(leave unnamed if you want to keep the same name)
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/* (to delete everything)
Hdfs dfs –rm –r /path (to delete a folder)
Hdfs dfs –mv /path… /path… (to move)
Hdfs dfs –put /home/training/training_materials/analyst/data/games.csv /user/hive/warehouse/fun.db/games/
Hdfs dfs –mkdir /path/dir (-p to create the parent folder (s))
29
CDOSS Certificate
• Loading data with Sqoop (RDBMS to Hadoop and the opposite)
• All tables from a database to hdfs

• A table to hdfs
• Part of a table to hdfs
• Hdfs data to an RDBMS
Example : sqoop import --connect jdbc:mysql://localhost/company –username training –password training –table customers
• The customers folder will be created in the hdfs user

• The default format is textfile
• The default separator is ‘,’
• A record will be a line
• We can use another connector than jdbc
• If there is no primary key, use --split-by column (string possible from sqoop 1.4.6 *) or --m 1
• We use import-all-table to load all the tables in a database
II.11* * you must use Drog.apache.sqoop.splitter.allow_text_splitter = true just after import
30
28/04/2023
CDOSS Certificate
• Loading data with Hive
Insert
• Easier loading with SQL
Advantages
• Data processing possible during loading Load data
Load data inpath Hdfs

S3
Load data inpath ‘/path/dir/file.ext’ into table name; (‘/path/dir/’ to move all the contents of dir
I/ Load Files are added to others if they already exist, in case of the same name, the new ones will be
renamed
To overwrite what exists, use overwrite : Load data inpath ‘/path/dir/file.ext’ overwrite into
table name;
31
CDOSS Certificate
• Loading data with Hive
Insert into table table_name values(value1,value2…),(value n,value n+1…)
 Add files without deleting old ones
 Adding small fragments of data (<64MB) creates a lot of small files, which will
II/ Insert make processing more cumbersome
Insert overwrite table table_name select * from tablex; (loading all the data)
Insert into table table_name select * from tablex; (also possible)
Optional
CTAS : Create Table As Select
III/ CTAS Create table chicago_emp row format delimited fields terminated by ‘,’
As select * from employees where office_id=‘b’;
32
28/04/2023
CDOSS Certificate
Q1 Which hdfs dfs subcommand can be used to place a file that's already in
HDFS into a different directory, without also leaving it in the original
directory?
• -cp
• -get
• -ls
• -mv
• -put
• -rm
33
CDOSS Certificate
Q2 What is the default setting for serialization.null.format?
• Empty string ('')
• Space character (' ')
• NULL
• \N
34
28/04/2023
CDOSS Certificate
Q3
This command uses Sqoop to import data from MySQL to HDFS on the
VM (with the user root).
$ sqoop import --connect jdbc:mysql://localhost/mydb \
--username root --password hadoop \
--table example_table
Where in HDFS will the data be found?
• /user/hive/warehouse/mydb.db/example_table
• /user/training/example_table
• /user/training/mydb.db/example_table
• /user/hive/warehouse/example_table
35
CDOSS Certificate
Q4 Which is a correct statement about Sqoop?
• You can use one command to import all tables in a database to HDFS
• You can only import one table at a time to HDFS
36
28/04/2023
CDOSS Certificate
Q5 Which are valid commands for loading data into tables using Hive?
• LOAD DATA INPATH '/path/to/file.ext' INTO TABLE table;
• LOAD DATA '/path/to/file.ext' OVERWRITE INTO TABLE table;
• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE INTO TABLE table;
• LOAD DATA '/path/to/file.ext' INTO TABLE table;
• LOAD DATA '/path/to/file.ext' OVERWRITE TABLE table;
• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE TABLE table;
37
CDOSS Certificate
Q6 Which best describes the issue of the “small files problem”?
• Small files are ignored by the big data systems
• Small files can't be compressed much, so they take too much space to
store
• Small files slow the system down more than (fewer) big files holding the
same amount of data
• Small files don't have enough data for the big data processes to work
with
38
28/04/2023
CDOSS Certificate
Q7 It's possible to create a table and then populate it with data from
another table using INSERT...SELECT. You can also perform these two
steps with a single command. What is that technique called?
• Simple cloning
• CTAS
• Modified cloning
• CREATE INSERT
• CREATE SELECT
• CTIS
39
CDOSS Certificate
• Views
• Simplify access to tables for frequent queries
• Restrict access to certain column (security)
 Not materializable (Hive will materialize them from 3.0)
 No data deletion when deleting a view
Create view view_name As Select * from view_name;

Select…..;
Faster (less use of SQL engine resources)

Materialization
Non-automatic update
II.14*
40
28/04/2023
CDOSS Certificate
• Partitioning :
* The data is very large and the queries too slow
* The data is already stored in folders according to the partition column (without it being in the data)
* Most queries are for one or two columns
 Static • When data grows according to partition column(s)

• When the data to load is already split into files according to the partition column
 Dynamic Automatic partitioning when loading (dangerous *)
 A partition will be a folder containing part of the data

 A good partitioning field should preferably be categorical with a reasonable number of categories
*SET hive.exec.dynamic.partition = true;

II.13* SET hive.exec.dynamic.partition.mode=nonstrict;
41
CDOSS Certificate
• Distributed Storage Motors Vs Distributed Storage Systems
Most used with Hive and other distributed Sql engine
The data is listable and accessible
Abstraction of data storage details and formats
Data encapsulation with a high level manipulation interface
42
28/04/2023
CDOSS Certificate
• Complex types (mul ple values in a ﬁeld):
Array: same types

Map: keys  same types,
values  same types,
key and values can be with different types
Struct: data can be of different types
 Possible Combination
43
CDOSS Certificate
Q1 In the file system, how is a table partition represented?
• As a distinct HDFS host
• As a file within the table’s storage directory
• As a distinct logical disk on the physical hard disk
• As a subdirectory within the table’s storage directory
44
28/04/2023
CDOSS Certificate
Q2 Which complex data type represents an ordered list of values, all

having the same data type?
• ARRAY
• STRUCT
• MAP
45
CDOSS Certificate
Q3 Which of the following statements accurately describe

distributed file systems like HDFS? Check all that apply.
• They encapsulate data storage, exposing a high-level interface to the data
• They store data in files that you can list and directly access
• They abstract away the details of how the data is stored in files using specific file formats
• They are the systems most often used to store data for Hive
46

MBDHC 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MBDHC 2

Uploaded by

Copyright:

Available Formats

28/04/2023

Hdfs dfs –ls /path

Q6: What delimiter is used to separate the values in the lines of

• hdfs dfs ls /path/to/file

• hdfs dfs -get /path/to/file

• hdfs dfs cat /path/to/file

• hdfs dfs get /path/to/file

• hdfs dfs -cat /path/to/file

• hdfs dfs -ls /path/to/file

• DDL with Hive

External Table DDL advanced options: II.5*

• Describe Table II.7*

 Delete database and Table :

Q1 A new database and table are created using the following

Q2 Which statement or statements correctly set /dualcore/newtable/ as the directory that

• CREATE TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';

• CREATE TABLE newtable (col1 STRING, col2 INT) EXTERNAL '/dualcore/newtable/';

• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';

• CREATE TABLE newtable (col1 STRING, col2 INT) LOCATION '/dualcore/newtable/';

• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) LOCATION

• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS CSV;

• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS TEXTFILE.CSV;

• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TBLPROPERTIES

• Describe: column and type

• Show create table: table definition (column and data type)

Textfile : excellent interoperability | low performance | human readability possible

Avrofile : binary serialization | supported by many tools* | excellent interoperability | excellent

Ingest patterns • In block  parquet: uses patterns to store efficiently

Need for interoperabilty  parquet/textfile

Data lifetime • Good interoperability

Q3 The International Standard ISO-3166-1993 provides a unique two-

Q5 Which are valid methods to find the data types of a column?

Q6 Which are advantages of using Text Files?

• Columnar format (which improves performance for some access

• Good interoperability (used by many applications)

• Easily read by humans

• Names of the columns in the data

• On-premises or cloud storage

• All tables from a database to hdfs

• The customers folder will be created in the hdfs user

II.11* * you must use Drog.apache.sqoop.splitter.allow_text_splitter = true just after import

Load data inpath Hdfs

Insert into table table_name select * from tablex; (also possible)

Q2 What is the default setting for serialization.null.format?

• Empty string ('')

• Space character (' ')

Q4 Which is a correct statement about Sqoop?

• You can only import one table at a time to HDFS

• LOAD DATA '/path/to/file.ext' OVERWRITE INTO TABLE table;

• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE INTO TABLE table;

• LOAD DATA '/path/to/file.ext' INTO TABLE table;

• LOAD DATA '/path/to/file.ext' OVERWRITE TABLE table;

• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE TABLE table;

Q6 Which best describes the issue of the “small files problem”?

• Small files are ignored by the big data systems

 No data deletion when deleting a view

Create view view_name As Select * from view_name;

Faster (less use of SQL engine resources)

 Static • When data grows according to partition column(s)

 Dynamic Automatic partitioning when loading (dangerous *)

 A partition will be a folder containing part of the data

*SET hive.exec.dynamic.partition = true;

• Distributed Storage Motors Vs Distributed Storage Systems