Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop
Cluster
© Dr Heni Bouhamed
Big Data Trainer
Senior Lecturer at Sfax University
Senior Lecturer at ESTYA University (France)
Cloudera Instructor at Elitech Paris
Certified CDOSS Big Data Instructor
Heni.bouhamed@fsegs.usf.tn

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Browsing Files :

Hdfs dfs –ls /path


Commande line Hdfs dfs –get /path/to/hdfs /path/to/local
Hdfs dfs –cat /path

2
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q1: Use the hive shell or beeline to view the tables in the default
database. Which of the following tables are in the default database?
Check all that apply.
• crayons
• customers
• employees
• games
• inventory
• makers
• offices
• orders
• salary_grades
3

CDOSS Certificate
Managing Big Data in Hadoop Cluster
Q2: Use hive shell or beeline to view the columns in the orders table in the default database. Which of
the following columns are in the orders table? Check all that apply.

• Country

• cust_id

• empl_id

• name

• office_id

• order_id

• total

4
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q3: Once you're in hive shell or beeline, one way to get the schema of
the makers table is to use:

• Select the toy database with use toy; and then run the command
SHOW DATABASE;
• Select the toy database with use toy; and then run the command
DESCRIBE TABLES;
• Select the toy database with use toy; and then run the command
SHOW TABLES;
• Select the toy database with use toy; and then run the command
DESCRIBE makers;
• Select the toy database with use toy; and then run the command
SHOW TABLE makers;

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q4: By default, what directory in HDFS would store the data for a table
named characters in a database named muppets?

• /user/hive/warehouse/muppets.db/characters

• /muppets.db/characters

• /user/hive/warehouse/characters

• /muppets/characters

• /user/hive/warehouse/muppets/characters

6
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q5: By default, what directory in HDFS would store the data for a table
named incidents in the default database?

• /default.db/incidents

• /user/hive/warehouse/incidents

• /user/hive/warehouse/default/incidents

• /user/hive/warehouse/default.db/incidents

• /default/incidents

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q6: What delimiter is used to separate the values in the lines of


the text file containing the data in the crayons table in the wax
database? (Hint: You need to go look at the file in the VM.)

• comma (,)

• tab

8
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q7: Which command will print the contents of a file in HDFS to the terminal screen?

• hdfs dfs ls /path/to/file

• hdfs dfs -get /path/to/file

• hdfs dfs cat /path/to/file

• hdfs dfs get /path/to/file

• hdfs dfs -cat /path/to/file

• hdfs dfs -ls /path/to/file

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• DDL with Hive

External Table DDL advanced options: II.5*


Internal Table
• TBLPROPERTIES:
Clauses Example:
-‘separator’
Row format delimited fields terminated by - by default ASCII control+A
TBLPROPERTIES(‘skip.header.line.count’=‘1’)
II.1* alter table name SET TBLPROPERTIES(‘serialization.null.format’=‘…’);
- To choose a format • Hive serdes: II.6* II.10*
Stored as Example:
- Default textfile
II.2* Row format serde ’org.apache.hadoop.hive.serde2….’
- - Force the creation location outside the warehouse
Location - - Creation with existing data outside the warehouse
II.3* - - Several external table and they can be used together
- - Without this clause, the creation will be in the Hive warehouse • LazySimpleSerde (by default)
• openCSVSerde
IF NOT EXIST  test the existence of the table before creation • RegexSerDe -Sep in fileds
DDL Options • JsonSerDe -Missing values
II.4* LIKE  clone the schema of a table but without the data

10
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

 Non-strict coupling: with Hive you can define several external tables with the same file (with different
delimiters for example ...)

• Describe Table II.7*


- Describe table_name;
- Describe formatted table_name; (more details concerning location and format)
- Show create table table_name; (creation details)

 Delete database and Table :


- Drop table [if exists] table_name; External : metadata
Internal : metadata, data et directory (according to access right)

- Drop database [if exists] db_name;  if it contains tables : drop database db_name cascade;

11

CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Modifications : alter table tab_name action settings
- Rename table : alter table customers rename to clients; internally the folder will be renamed
- Move table to another data base : alter table default.clients rename to dig.clients;
(internally the folder will be moved, externally only the metadata will be moved)
- Change column name and / or data type :
Alter table employees change first_name fname string;
We can do both
Alter table employees change salary salary bigint;
- Change the column(s) order :
Alter table employees change salary salary int after office_id;
Alter table employees change salary salary int first;
(Changing the order does not change the data)
- Replace all columns : alter table tab_name replace columns(col1 type1,…);
- Change an internal table to an external one :
alter table tab_name SET TBLPROPERTIES(‘EXTERNAL’=‘TRUE’); II.8*

12
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q1 A new database and table are created using the following


statements. Where in HDFS is the data for this table located? CREATE
DATABASE thisdb; CREATE TABLE thisdb.thistable (col1 TINYINT, col2
INT);
• /user/hive/thistable
• /user/hive/warehouse/thisdb.db/thistable
• /user/warehouse/thistable
• /user/thisdb.db/thistable
• /user/hive/warehouse/thistable
• /user/thistable
• /user/warehouse/thisdb.db/thistable
• /user/hive/thisdb.db/thistable

13

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q2 Which statement or statements correctly set /dualcore/newtable/ as the directory that


holds the data for a table called newtable? Check all that apply.

• CREATE TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';

• CREATE TABLE newtable (col1 STRING, col2 INT) EXTERNAL '/dualcore/newtable/';

• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) '/dualcore/newtable/';

• CREATE TABLE newtable (col1 STRING, col2 INT) LOCATION '/dualcore/newtable/';

• CREATE EXTERNAL TABLE newtable (col1 STRING, col2 INT) LOCATION


'/dualcore/newtable/';

14
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q3 Which statement correctly sets the file type for newtable as a comma-delimited text file?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT '.csv';

• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\c';

• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS CSV;

• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS TEXTFILE.CSV;

• CREATE TABLE newtable (col1 STRING, col2 INT) FILE FORMAT CSV;

15

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q4 Which statement correctly uses a SerDe for reading and writing newtable's data?
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;
• CREATE TABLE newtable (col1 STRING, col2 INT) STORED AS SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) WITH SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde';
• CREATE TABLE newtable (col1 STRING, col2 INT) ROW FORMAT SERDE
org.apache.hadoop.hive.serde2.OpenCSVSerde;

16
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q5 Which statement shows correct syntax for skipping a header line in each file of a table's data?

• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TBLPROPERTIES


'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TABLEPROPERTIES
('skip.header.line.count'='1');
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TBLPROPERTIES
('skip.header.line.count'='1');
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TBLPROPERTIES ('skip.header.line.count'='1');
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TABLEPROPERTIES 'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TABLEPROPERTIES
'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) TBLPROPERTIES 'skip.header.line.count'='1';
• CREATE TABLE table_with_header (col1 INT, col2 STRING) SET TABLEPROPERTIES
('skip.header.line.count'='1');

17

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Data Type

18
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Describe: column and type

• Examine the types of data: • Describe formatted : column type and more details of the table

• Show create table: table definition (column and data type)

 Out of range:

Decimal  NULL
Float+double  -infinity ou infinity
Int  NULL

19

CDOSS Certificate
Managing Big Data in Hadoop Cluster
II.9*
 File types : textfile AvroFile Parquet ORCFILE

Textfile : excellent interoperability | low performance | human readability possible


No compression | direct loading possible

Avrofile : binary serialization | supported by many tools* | excellent interoperability | excellent


performance | human reading not possible
Using Integrated Schema | direct loading not possible

Parquet : excellent interoperability | excellent performance | column oriented | human reading not possible | possible
compression | direct loading not possible

OrcFile : low interoperability | excellent performance | column oriented | human readability not possible | compression
possible | direct loading not possible

Sequence file : linked to java and hadoop, good performance but poor interoperability
RCfile: columnar textfile, low interoperability, low performance (original version of ORCfile)

20
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Choice of format

Ingest patterns • In block  parquet: uses patterns to store efficiently


• Not in block  not in parquet

Need for interoperabilty  parquet/textfile

Data lifetime • Good interoperability


• Preferably data with integrated and Parquet/Avro
scalable schema
Data size  file compression  parquet/orcfile

21

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q1 Which of the following are integer types used by Hive? Check all
that apply.
• LARGEINT

• TINYINT

• BIGINT

• BYTEINT

• INT

22
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q2 Which of these data types has the greatest range of possible values?
(Note that this question is asking about range, not precision.)

• DOUBLE

• FLOAT

• DECIMAL

23

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q3 The International Standard ISO-3166-1993 provides a unique two-


letter code for each country on its list. For example, Japan is JP, India is
IN, and Indonesia is ID. Each of the following string types could be used
for a column that holds values from this list. Which is best to use?

• CHAR(2)

• VARCHAR(2)

• STRING

24
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q4 Suppose you have data that looks similar to the following: an ID code (always containing
digits with a non-zero leading digit), a name, an order code (could contain letters or digits),
and a total amount in US dollars and cents (seven digits or less).
83928,Carey Myagi,C83820K,28.09
Which of the following would be the best choice of data types for these four columns (in
the same order)?
• TINYINT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, FLOAT
• TINYINT, STRING, STRING, DECIMAL(7,2)
• INT, STRING, STRING, FLOAT
• TINYINT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, DECIMAL(7,2)
• INT, STRING, INT, FLOAT
• INT, STRING, STRING, DECIMAL(7,2)

25

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q5 Which are valid methods to find the data types of a column?

• Use the data source (left assist) panel in Hue to see the table schema
• Use DESCRIBE SCHEMA table_name; to see the table schema
• Use the Table Browser in Hue to see the table schema
• Use SHOW CREATE TABLE table_name; to see the table schema
• Use DESCRIBE table_name; to see the table schema
• Use SHOW TYPE column_name FROM table_name; to get the type of column_name
• Use SELECT typeof(column_name) FROM table_name; to get the type of column_name
(Impala only)

26
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q6 Which are advantages of using Text Files?

• Columnar format (which improves performance for some access


patterns)

• Efficient storage

• Good interoperability (used by many applications)

• Easily read by humans

27

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q7 Which are important considerations when choosing a file type for storing data?
• Data size and query performance

• Names of the columns in the data

• On-premises or cloud storage

• Ingest pattern

• Interoperability needs

28
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with command line
We have already seen : *hdfs dfs –ls *hdfs dfs –cat *hdfs dfs –get (hdfs to local)
Exercice:
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/ancient_games.csv
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –cp /old/games/ancient_games.csv /user/hive/warehouse/fun.db/games/
(leave unnamed if you want to keep the same name)
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –rm /user/hive/warehouse/fun.db/games/* (to delete everything)
Hdfs dfs –rm –r /path (to delete a folder)
Hdfs dfs –mv /path… /path… (to move)
Hdfs dfs –put /home/training/training_materials/analyst/data/games.csv /user/hive/warehouse/fun.db/games/
Hdfs dfs –ls /user/hive/warehouse/fun.db/games/
Hdfs dfs –mkdir /path/dir (-p to create the parent folder (s))

29

CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Sqoop (RDBMS to Hadoop and the opposite)

• All tables from a database to hdfs


• A table to hdfs
• Part of a table to hdfs
• Hdfs data to an RDBMS

Example : sqoop import --connect jdbc:mysql://localhost/company –username training –password training –table customers

• The customers folder will be created in the hdfs user


• The default format is textfile
• The default separator is ‘,’
• A record will be a line
• We can use another connector than jdbc
• If there is no primary key, use --split-by column (string possible from sqoop 1.4.6 *) or --m 1
• We use import-all-table to load all the tables in a database

II.11* * you must use Drog.apache.sqoop.splitter.allow_text_splitter = true just after import

30
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Hive
Insert
• Easier loading with SQL
Advantages
• Data processing possible during loading Load data

Load data inpath Hdfs


S3
Load data inpath ‘/path/dir/file.ext’ into table name; (‘/path/dir/’ to move all the contents of dir

I/ Load Files are added to others if they already exist, in case of the same name, the new ones will be
renamed

To overwrite what exists, use overwrite : Load data inpath ‘/path/dir/file.ext’ overwrite into
table name;

31

CDOSS Certificate
Managing Big Data in Hadoop Cluster
• Loading data with Hive
Insert into table table_name values(value1,value2…),(value n,value n+1…)
 Add files without deleting old ones
 Adding small fragments of data (<64MB) creates a lot of small files, which will
II/ Insert make processing more cumbersome
Insert overwrite table table_name select * from tablex; (loading all the data)

Insert into table table_name select * from tablex; (also possible)

Optional
CTAS : Create Table As Select
III/ CTAS Create table chicago_emp row format delimited fields terminated by ‘,’
As select * from employees where office_id=‘b’;

32
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q1 Which hdfs dfs subcommand can be used to place a file that's already in
HDFS into a different directory, without also leaving it in the original
directory?

• -cp
• -get
• -ls
• -mv
• -put
• -rm

33

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q2 What is the default setting for serialization.null.format?

• Empty string ('')

• Space character (' ')

• NULL

• \N

34
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q3
This command uses Sqoop to import data from MySQL to HDFS on the
VM (with the user root).
$ sqoop import --connect jdbc:mysql://localhost/mydb \
--username root --password hadoop \
--table example_table
Where in HDFS will the data be found?

• /user/hive/warehouse/mydb.db/example_table
• /user/training/example_table
• /user/training/mydb.db/example_table
• /user/hive/warehouse/example_table

35

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q4 Which is a correct statement about Sqoop?

• You can use one command to import all tables in a database to HDFS

• You can only import one table at a time to HDFS

36
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q5 Which are valid commands for loading data into tables using Hive?
• LOAD DATA INPATH '/path/to/file.ext' INTO TABLE table;

• LOAD DATA '/path/to/file.ext' OVERWRITE INTO TABLE table;

• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE INTO TABLE table;

• LOAD DATA '/path/to/file.ext' INTO TABLE table;

• LOAD DATA '/path/to/file.ext' OVERWRITE TABLE table;

• LOAD DATA INPATH '/path/to/file.ext' OVERWRITE TABLE table;

37

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q6 Which best describes the issue of the “small files problem”?

• Small files are ignored by the big data systems

• Small files can't be compressed much, so they take too much space to
store

• Small files slow the system down more than (fewer) big files holding the
same amount of data

• Small files don't have enough data for the big data processes to work
with

38
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q7 It's possible to create a table and then populate it with data from
another table using INSERT...SELECT. You can also perform these two
steps with a single command. What is that technique called?

• Simple cloning
• CTAS
• Modified cloning
• CREATE INSERT
• CREATE SELECT
• CTIS

39

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Views
• Simplify access to tables for frequent queries
• Restrict access to certain column (security)
 Not materializable (Hive will materialize them from 3.0)

 No data deletion when deleting a view

Create view view_name As Select * from view_name;


Select…..;

Faster (less use of SQL engine resources)


Materialization
Non-automatic update

II.14*

40
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Partitioning :
* The data is very large and the queries too slow
* The data is already stored in folders according to the partition column (without it being in the data)
* Most queries are for one or two columns

 Static • When data grows according to partition column(s)


• When the data to load is already split into files according to the partition column

 Dynamic Automatic partitioning when loading (dangerous *)

 A partition will be a folder containing part of the data


 A good partitioning field should preferably be categorical with a reasonable number of categories

*SET hive.exec.dynamic.partition = true;


II.13* SET hive.exec.dynamic.partition.mode=nonstrict;

41

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Distributed Storage Motors Vs Distributed Storage Systems

Most used with Hive and other distributed Sql engine

The data is listable and accessible

Abstraction of data storage details and formats

Data encapsulation with a high level manipulation interface

42
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

• Complex types (mul ple values in a field):

Array: same types


Map: keys  same types,
values  same types,
key and values can be with different types
Struct: data can be of different types
 Possible Combination

43

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q1 In the file system, how is a table partition represented?

• As a distinct HDFS host

• As a file within the table’s storage directory

• As a distinct logical disk on the physical hard disk

• As a subdirectory within the table’s storage directory

44
28/04/2023

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q2 Which complex data type represents an ordered list of values, all


having the same data type?

• ARRAY

• STRUCT

• MAP

45

CDOSS Certificate
Managing Big Data in Hadoop Cluster

Q3 Which of the following statements accurately describe


distributed file systems like HDFS? Check all that apply.

• They encapsulate data storage, exposing a high-level interface to the data

• They store data in files that you can list and directly access

• They abstract away the details of how the data is stored in files using specific file formats

• They are the systems most often used to store data for Hive

46

You might also like