Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

BDA

1) MapReduce framework of Hadoop is also takes care of _______.


a) Scheduling b) Monitoring c) Re-executing failed task d) All

2) The term NoSQL was first coined by _______.


a) Doug Laney b) Carlo Strozzi c) Bre wer d) Gartner

* 3) The structured, unstructured and semi-structured data is


deals with which of the following characteristics?
a) Velocity b) Volatility c) Variability d) Volume

* 4) In which of the following analysis, Data is descriptive,


predictive and prescriptive?
a) Analytics 1.0 b) Analytics 2.0 c) Analytics 3.0 d) None

5) When NameNode starts up, it reads the ______ and ______ from
disk.
a) TaskTracker,JobTracker b) FsImage, EditLog
c) Master Node, Slave Node d) None of the above.

6) Which of the following is a tool to transfer data between Hadoop


and Relational Databases?
a) Sqoop b) HBase c) Hive d) Pig

7) Which of the following is/are advantages of Hadoop?


a) Scalable b) Cost Effective c) Fault-tolerant d) All the above
8) Core mongo DB operations are ________.
a) create, select, update, delete b) create, read, update, delete
c) create, read, update, drop d) create, remove, update, drop

9) Cassandra is a ________ database.


a) Document-oriented b) Graph-oriented c) Column-oriented d)
SQL

10) A List is a collection of _________.


a) Unordered elements b) Ordered elements
c) Paired elements d) Only images

11) PIG is ________.


a) dataflow language b) NoSQL database
c) import export tool d) scheduling engine

12) Hive provides _________ kinds of partitions.


a) Static b) Dynamic
c) Both Static and dynamic d) Neither static nor dynamic

13) ______ is Data warehousing tool.


a) Jaspersoft studio b) Cassandra c) Pig d) Hive

14) ETL processing in Pig stand for _________.


a) Extract, transform and load b) Extend transfer and load
c) Extract, transform and local d) None of the above
15) _______ used to transmit data between web server and web
application.
a) XML b) JSON c) Both a) and b) d) None

16) MS-Excels are under the category of _________ data.


a) Structured data b) Semi-structured data c) Unstructured data
d) None

17) Real time processing deals with __________ characteristic of


data.
a) Variety b) Velocity c) Variability d) Volume

18)The 3V’s term of big data was first introduced by


a) Doug Laney b) Brewer c) Carlo Strozzi d) Doug Cutting

19) _________ has no support for ACID properties.


a) SQL b) MySQL c) NewSQL d) NoSQL

20)A coordinated processing of a program by multiple processors,


each working on different part of the program and using its own
operating system and memory is called
a) In memory analysis b) Distributed system
c) Massively parallel processing d) Shared disk

* 21) Node manager is responsible for launching application


a) Resources b) Job c) Containers d) None

22) Which of the following format is supported by MongoDB ?


a) XML b) SQL c) BSON d) All of the mentioned
23) Core MongoDB operation are
a) Create, Select, Update, Delete b) Create, Read, Update,
Delete
c) Create, Read, Update, Drop d) None

* 24) Which of the following is correct command to update a


document ?
a) db.books.update({item:’book”,qty:{$gt:7},{$set:{x:5},$inc:{y:8})
b) db.books.find( ).update({item:“book”,qty:{$gt:7},{$set:{x:5},$inc:{y:8})
c)
db.books.update({item:’book”,qty:{$gt:7},{$set:{x:5},$inc:{y:8},{multi:true}
)
d) db.books.find( ).update({item:’book”,qty:{$gt:7},{$set:{x:5},$inc:{y:8},
{multi:true})

25) Which of the following is wide column store ?


a) MongoDB b) Cassandra c) Riak d) Redis

26) MetaStore contains _________ of Hive tables.


a) System catalog b) Drivers c) Database d) CLI

27) The interactive mode of pig is


a) Pig Engine b) Grunt c) ETL d) Pig Latin

28) In Cassandra __________ is called peer to peer communication


protocol used for intra ring communications.
a) Anti-Entropy b) Gossip protocol c) Hinted Handoffs d) All
29)CCTV footage is which type of data ?
a) Structured data b) Unstructured data c) Semi-structured data d) All
of these

* 30)Big volume of data like 1 Yottabyte is equal to


a) 1024 ∧ 4 bytes b) 1024 ∧ 6 bytes c) 1024 ∧ 8 bytes d) 1024 ∧
9 bytes

31) System will continue to function even when network partition


occurs is
a) Partition tolerant b) Consistency c) Availability d) None

* 32)___________ is robust database that support ACID properties


of transaction and has the scalability of NoSQL.
a) SQL b) NewSQL c) MysQL d) NoSQL

33) A typical block size is used by HDFS is


a) 64 MB b) 128 MB c) 32 MB d) 256 MB

34) _____________ is the book-keeper of HDFS.


a) Name node b) Data node c) Job Tracker d) Task tracker

35) ) Name node uses ______________ to record every transaction.


a) FsImage b) EditLog c) Data node d) Map Reduce

36)MongoDB has been adopted as _____________ software by a


number of major websites and services.
a) Frontend b) Backend c) Proprietary d) All of the mentioned
37) Which one of the following is equivalent to ? Select * from
employee order by salary desc;
a) db.employee.find.sort({“salary”:1}) b)
db.employee.find.sort({“salary”:-1})
c) db.employee.sort({“salary”:1}) d) db.employee.sort({“salary”:-
1})
38)Hive is ______________ tool.
a) Data Flow b) Data Warehouse c) Import Export Tool d) Data
Transfer

39) Cassandra is column oriented database designed to support


__________ symmetric node architecture.
a) Peer to Peer b) Master Slave c) Both a) and b) d) Non

40)The 3 types of collection used in Cassandra are


a) Array, Set, List b) Set, List, struct
c) Set, MAP, Array d) Set, List, MAP

41)MetaStore consist of _____________ and a ____________


a) Metaservices, database b) Metatable, WebUI
c) Metaservices, drivers d) CLI, Server

* 42)Pig is used in ____________ process.


a) ETL b) Scripting c) Database d) None

43)E-mails are under the catagory of ___________ data.


a) Structured data b) Semi-structured data
c) Unstructured data d) None of above
44)______ is used to transmit data between a web server and a web
application.
a) XML b) JSON c) Both a and b d) None

45)MongoDB based on ________ and _________ properties of CAP


theorem.
a) consistancy and availability b) availability and partition
tolerance
c) consistancy and partition tolerance d) all of the above

* 46)Big volume of data like 1 Zettabytes is equal to


a) 10244 Bytes b) 10246 Bytes c) 10245 Bytes d) 10247
Bytes

47)Cap theorem is also called as ________ theorem.


a) Doug Laney b) Brewer c) Carlo Strozzi d) Doug
cutting

*48) Hadoop supports ________ data formats.


a) structured b) semi-structured c) unstructured d) all
above

49) MongoDB is
a) RDBMS b) Document-oriented DBMS
c) Object Oriented DBMS d) Key-value store

50) Which command in MongoDB is equivalent to SQL select ?


a) find() b) search() c) document() d) none of above
* 51) ) What does the following command do ?
db.sample.find().limit(10)
a) Show 10 documents randomly from the collection sample
b) Show only first 10 documents from the collection sample
c) Repeats the first document 10 times
d) None of above

52)In Cassandra, _________ is called peer-to-peer communication


protocol used for intra-ring communication.
a) Anti-entropy b) Hinted Handoffs c) Gossip protocol d) None
of above

53)Databases are under the catagory of data.


a) Structured data b) Semi-structured data
c) Unstructured data d) None of above

54`) What was Hadoop named after ?


a) Creator Doug cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop’s development

55) A____ serves as the master and there is only one NameNode
per cluster.
e) Data Node b) NameNode c) Data block d)
Replication

56)_____ NameNode is used when the Primary NameNode


goes down.

a) Rack b) Data c) Secondary d) None of


these

57)What is the default HDFS replication factor ?


a) 4 b) 1 c) 3 d) 2

*58)Which one is not one of the big data feature ?

a)Velocity b) Veracity c) Volume d) Variety

59)NameNode in HDFS uses ___to store file system name space.

a)EditLog b) Fslmage c) Data node d) Map reduce


60) How many blocks will be created for a file that is 300 MB ?
The default block size is 64 MB and the replication factor is 3.
a) 30 b) 15 c) 5 d) 100

61)What does job tracker do ?


a) Stores block of data b) Store metadata
c) Coordinate and schedule the job d) Act as mini reducer

62)HDFS is based on
a) Facebook file system b) Google file system
c) lBM file system d) Yahoo file system

63)The hadoop frame work is written in


a) C++ b) Python c) Java d) C

64) Apache Cassandra was born at


a) Google b) Facebook c) lBM d) Yahoo

65)Which of the following is a valid flow in Hadoop ?

a. Input -> Reducer -> Mapper -> Combiner -> -> Output
b. Input -> Mapper -> Reducer -> Combiner -> Output
c. Input -> Mapper -> Combiner -> Reducer -> Output
d. Input -> Reducer -> Combiner -> Mapper -> Output
66)Hive is used as
a) Data Flow Language b) Data Warehousing
Language
c) Workflow Language d) Scheduling Language

67)MapReduce was devised by

a)Apple b) Google c) Microsoft d) Samsung

68)Hadoop was first introduced by


a) Doug Laney b) Brewer c) Carlo Strozzi d) Doug
cutting

69)On a single Hadoop cluster how many Name node can run ?
a)depend on clusters b) only one
c) only 3 d) depend on data nodes

* 70)Apache hadoop YARN is a sub-project of


a)Hadoop 1.0 b) Hadoop 2.0 c) Both d) None

71)A container used to hold application data in Cassandra is called


a)Document b) Table c) Keyspaces d) Record

SECTION-II
1. Pig in Hadoop Eco system is

a) Data Flow language b) NoSQL database c) import export tool d) scheduling


engine

Ans: A

2. Which of the following function is used to read data in PIG ?


A. WRITE

B. READ

C. LOAD -ans

D. None of the mentioned

3. You can run Pig in interactive mode using the ______ shell.

A. Grunt -ans

B. FS

C. HDFS

D. None of the mentioned

4. ________ is the slave/worker node and holds the user data in the form of Data
Blocks.

A. DataNode -ans

B. NameNode

C. Data block

D. Replication

5. What was Hadoop named after?


a) Creator Doug Cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop’s development

6. A ________ serves as the master and there is only one NameNode per cluster.
a) Data Node b) NameNode c) Data block d) Replication

7. What is the default HDFS replication factor?


a) 4 b) 1 c) 3 d) 2

8. Which one is not one of the big data feature?


a) Velocityb) Veracity c) Volume d) Variety

9. The 3 types of Collections used in Cassandra are


a) Array, List, Map b) List, Map,Struct c) List, Set, Map d) Set, Map, Array
10. The metastore in Hive consist of ______ and ______
a) driver, services b) metaservices, database
c) driver, database d) metaservices, driver

11. Pig is __________________


a) Data Flow language b) NoSQL database c) import export tool d) scheduling
engine

12. Which of the following is a valid flow in Hadoop ?

a. Input -> Reducer -> Mapper -> Combiner -> -> Output
b. Input -> Mapper -> Reducer -> Combiner -> Output
c. Input -> Mapper -> Combiner -> Reducer -> Output -ans
d. Input -> Reducer -> Combiner -> Mapper -> Output
13. Hive is used as ______________
a) Data Flow language b) Data Warehousing language
c) Workflow language d) scheduling language
14. MapReduce was devised by ...

a) Apple b) Google -ans c) Microsoft d) Samsung


15. The hadoop frame work is written in
a) C++ b) Python c) Java d) C

16. Apache Cassandra was born at _________


a) Google b) Facebook c) IBM d) Yahoo

17. A container used to hold application data in Cassandra is called _______


a) Document b) Table c) Keyspaces -ans d) Record

18. Which one of the following is equivalent to following in MongoDB


Select * from employees order by salary desc;
a) db.employee.find().sort({“salary” :1}); b) db.employee.sort({“salary” :-1});
a) db.employee.find().sort({“salary” :-1}); d) db.employee.sort({“salary” :1});
19) The term ‘ ETL ‘ used in Pig stands for
b) a) extract, transform, load b) extend, transfer, load
c) c) extract, transform, local d) extract, transfer, load

20) Which of the following is component of Hadoop?

a) YARN
b) HDFS
c) Map Reduce
d) All of above - ans

21) Which of the following platforms does Apache Hadoop run on ?


a) Bare metal
b) Unix like
c) Cross platform -ans
d) Debian

22) Which type of data Hadoop can deal with is


a) Structured
b) Semi - structured
c) Unstructured
d) All of above - ans

23) Which of the following is a column-oriented database that runs on top of HDFS
a) Hive
b) Sqoop
c) HBase
d) Flume

24) Which of the following is not the Dameon process that runs on a
hadoop cluster ?

a. JobTracker
b. DataNode
c. TaskTracker
d. TaskNode -ans

25) MongoDB stores all documents in _____________

a) tables
b) collections -ans
c) rows
d) all of the mentioned

26) Which of the following query selects documents in the records collection that
match the condition { “user_id”: { $lt: 42 } }?

a) db.records.findOne( { “user_id”: { $lt: 42 } }, { “history”: 0 } )


b) db.records.find( { “user_id”: { $lt: 42 } }, { “history”: 0 } ) -ans
c) db.records.findOne( { “user_id”: { $lt: 42 } }, { “history”: 1 } )
d) db.records.select( { “user_id”: { $lt: 42 } }, { “history”: 0 } )

27) Which of the following key is used to denote uniqueness in the collection of
MongoDB?
a) _id -ans
b) id
c) id_
d) none of the mentioned

28) Which of the following line skips the first 5 documents in the bios collection and
returns all remaining documents in MongoDB?

a) db.bios.find().limit( 5 )
b) db.bios.find().skip( 1 )
c) db.bios.find().skip( 5 )
d) db.bios.find().sort( 5 )

29) Each database created in hive is stored as


a) a directory -ans
b) a file
c) a hdfs block
d) a jar file
Brahmdevdada Mane Institute of Technology, Solapur
Department of Computer Science & Engineering
Multiple Choice Questions
Class:- BE-CSE ACY:-2020-21
Subject:- Big Data Analytics Sem-II

Unit 1 Introduction to Types of Digital Data


1. XML is which type of data ?

a) structured data b) unstructured data c) semi-structured data d) all three

Ans: C

2. 7) E-mail is which type of data ?

a) structured data b) unstructured data c) semi-structured data d) all three

Ans: B

3. CCTV footage is which type of data ?

a) structured data b) unstructured data c) semi-structured data d) all three

Ans: B

4. XML and JSON are sources of which type of data ?

a) structured data b) unstructured data c) semi-structured data d) all three

Ans : C

5. OLTP system and spreadsheets are sources of which type of data ?

a) structured data b) unstructured data c) semi-structured data d) all three

Ans : A
6. Example on types of data

Structured Unstructured Semi-structured

MS Access Email XML

Database Images

Relations/Tables Chat conversations

MS Excel Facebook

Videos

7. Match the Following answer

Column A ANS

NLP Comprehend human or natural language


input

Text Analytics Text Mining

UIMA Content Analytics

Noisy unstructured Data Text Messages

Data Mining Uses methods at the intersection of


statistics, AI, machine learning & DBs.

Noisy Unstructured Data Chats

IBM UIMA
Unit 2 Introduction to Big Data
1. Doug Laney a gartner analyst coined the term ‘Big Data’

2. Volatility is the characteristics of data dealing with its retention.

3. Data Lakes is a large repository of data in its native format until it is needed.

4. Variability characteristics of data explains the spikes in data.

5. Near real time processing or real time processing deals with Velocity characteristics of
data.

6. Big data is high volume, high velocity and high variety information assets that demand
cost-effective , innovative forms of information processing for enhanced Insight and
Decision making.

7. Match the Following with answers

Column A Answer
Postgre Sql Open source relational database
Scientific Data Machine Generated unstructured data
Point of Sale Machine generated structured data
Social Media Data Human-generated unstructured data
Gaming Related Data Human-generated structured data
Mobile Data Human-generated unstructured data
Unit 3 Big Data Analytics
1) The expansion for CAP is ____________,______________ and _____________ .

a) Consistency, Ability, Partition Tolerance b) Consistency, Atomicity, Partition Tolerance

c) Consistency, Availability, Parallel d) Consistency, Availability, Partition Tolerance

Ans: D

2) MongoDB is ___________ and___________.

a) Consistency and Partition Tolerance(CP) b) Consistency and Availability (CA)

c) Availability and Partition Tolerance (AP) d) none of above

Ans: A

3) Cassandra is ___________and ___________.

a) Consistency and Partition Tolerance(CP) b) Consistency and Availability (CA)

c) Availability and Partition Tolerance (AP) d) none of above

Ans: C

4) __________ has no support for ACID properties of transaction.

a) NoSQL b) newSQL c) SQL d) none of above

Ans: A

5) __________ is a robust database that supports ACID properties of transactions and has the
scalability of NoSQL.

a) NoSQL b) newSQL c) SQL d) none of above

Ans: B

6) The expansion of BASE is ______________

Ans:- Basically available soft state eventual consistency

7) Hadoop has a shared nothing architecture

8) In a Hadoop 2.0 a new and separate resource management framework called Yet Another
Resource Negotiator(YARN) has been added.

9) CAP theorem is also called as Brewer Theorem

10) The In-memory analytics technology helps query data that resides in a computer random
access memory(RAM) rather than data stored on physical disks.
11) Eventually consistency is a consistency model used in distributed computing to achieve high
availability

12) Scalability is an important advantage of shared nothing architecture

13) In shared disk architecture multiple processors have their own private memory.

14) In shared memory architecture central memory is shared by multiple processors.

15) Ambari is a web based tool for provisioning, managing and monitoring Apache Hadoop
clusters.
Unit 4 Introduction to Hadoop
1. The 3 V terms of Big data was first introduced by _________

a) Doug Laney b) Brewer c) Carlo Strozzi d)Doug cutting

Ans: A

2. Hadoop was first introduced by __________

a) Doug Laney b) Brewer c)Carlo Strozzi d)Doug cutting

Ans : D

3. NoSQL was first introduced by ____________

a) Doug Laney b) Brewer c)Carlo Strozzi d)Doug cutting

Ans : C

4. Name node in HDFS uses _________ to store file system name space.

a) EditLog b) FsImage c) Data node d)Map reduce

Ans: B

5. Name node in HDFS uses _________ to record every transaction.

a) EditLog b) FsImage c) Data node d)Map reduce

Ans: A

6. A typical block size used by HDFS is _______

a) 32 MB b)64 MB c) 64 KB d)32 KB

Ans: B

7. How many blocks will be created for a file that is 300 MB? The default block size is 64
MB and the replication factor is 3.

a) 30 b)15 c)5 d) 100

Ans: B

8. What does Job tracker do ?

a) stores block of data b) store metadata c) coordinate and schedule the job d) act as mini
reducer
Ans: C

9. The MapReduce programming model widely used in analytics was developed at ______

a) Yahoo b) IBM c) Google d) Facebook

Ans: C

10. On a single Hadoop cluster how many Name node can run ?

a) depend on clusters b) only one c) only 3 d) depend on data nodes

Ans: A

11. HDFS is based on _______

a)Facebook file system b) Google file system c) IBM File system d)Yahoo file system

Ans.: B

12. Apache hadoop YARN is a sub-project of ______

a) Hadoop 1.0 b) Hadoop 2.0 c)Both d) none

Ans: B

13. YARN stands for ______________

a) Yet Another Relocator Node b) Yet Apache Resource Negotiator

c) Yet Another Resource Negotiator d) Yahoo Another Resource Negotiator

Ans: C

14. Hadoop supports Structured,Semistructured and unstructured data formats.

15. In Hadoop Data is processed in parallel.

16. NameNode uses FsImage to store file system namespace.

17. NameNode uses EditLog to record every transaction

18. Secondary NameNode is a Helper or HouseKeeping Daemon.

19. DataNode is responcible for Read/Write file operation.

20. Hadoop 2.x is based on YARN architecture.

21. YARN is responsible for Cluster Management.

22. HDFS has Master/Slave architecture.

23. HDFS is built using Java Language.


24. The name node periodically receives a Heartbeat and a block report from each of the
data nodes in the cluster.

25. Receipt of heartbeat implies that the Data Node is functioning properly.

26. A block report contains list of all blocks in on a data node.

27. Match me(Answers are given here in table)

HDFS Storage

Mapreduce programming Processing Data

Master Node Name Node

Slave Node Data Node

Hadoop Implementation Google File System and Map Reduce

28. Match me(Answers are given here in table)

Name Node Handles storage on master

Job Tracker Handles processing on master

Data Node Handles storage on slave

Task Tracker Handles processing on slave

29. Oozie is to import/export data from RDBMS Ans:- False

30. “hadoop fs-ls/” will show the contents for the HDFS root directory. Ans:- True
BIG DATA ANALYTICS SKNSCOEK

UNIT - I

1. As companies move past the experimental phase with Hadoop, many cite the need
for additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support

2. Point out the correct statement :


a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned

3. According to analysts, for what can traditional IT systems provide a foundation when
they’re integrated with big data technologies like Hadoop ?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data

4. Hadoop is a framework that works with a variety of related tools. Common cohorts
include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet

5. Point out the wrong statement :


a) Hardtop’s processing capabilities are huge and its real advantage lies in the ability to
process terabytes & petabytes of data
b) Hadoop uses a programming model called “MapReduce”, all the programs should
confirms to this model in order to work on Hadoop platform
BIG DATA ANALYTICS SKNSCOEK

c) The programming model, MapReduce, used by Hadoop is difficult to write and test
d) All of the mentioned

6. What was Hadoop named after?


a) Creator Doug Cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop’s development

7. All of the following accurately describe Hadoop, EXCEPT:


a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach

8. __________ can best be described as a programming model used to develop Hadoop-


based applications that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned

9. __________ has the world’s largest Hadoop cluster.


a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned

10. Facebook Tackles Big Data With _______ based on Hadoop.


a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
BIG DATA ANALYTICS SKNSCOEK

UNIT – II

1. ________ is a platform for constructing data flows for extract, transform, and load
(ETL) processing and analysis of large datasets.
a) Pig Latin
b) Oozie
c) Pig
d) Hive

2. Point out the correct statement :


a) Hive is not a relational database, but a query engine that supports the parts of SQL
specific to querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
d) All of the mentioned

3. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
a) Scalding
b) HCatalog
c) Cascalog
d) All of the mentioned

4. Hive also support custom extensions written in :


a) C#
b) Java
c) C
d) C++

5. Point out the wrong statement :


a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering
b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop
offering
BIG DATA ANALYTICS SKNSCOEK

c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
d) All of the mentioned

6. ________ is the most popular high-level Java API in Hadoop Ecosystem


a) Scalding
b) HCatalog
c) Cascalog
d) Cascading

7. ___________ is general-purpose computing model and runtime system for distributed


data analytics.
a) Mapreduce
b) Drill
c) Oozie
d) None of the mentioned

8. The Pig Latin scripting language is not only a higher-level data flow language but
also has operators similar to :
a) SQL
b) JSON
c) XML
d) All of the mentioned

9. _______ jobs are optimized for scalability but not latency.


a) Mapreduce
b) Drill
c) Oozie
d) Hive

10. ______ is a framework for performing remote procedure calls and data serialization.
a) Drill
b) BigTop
c) Avro
d) Chukwa
BIG DATA ANALYTICS SKNSCOEK
BIG DATA ANALYTICS SKNSCOEK

UNIT - III

1. IBM and ________ have announced a major initiative to use Hadoop to support
university courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google

2. Point out the correct statement :


a) Hadoop is an ideal environment for extracting and transforming small volumes of
data
b) Hadoop stores data in HDFS and supports data compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve graph and
machine learning
d) None of the mentioned

3. What license is Hadoop distributed under ?


a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial

4. Sun also has the Hadoop Live CD ________ project, which allows running a fully
functional Hadoop cluster using a live CD.
a) OpenOffice.org
b) OpenSolaris
c) GNU
d) Linux

5. Which of the following genres does Hadoop produce ?


a) Distributed file system
b) JAX-RS
c) Java Message Service
BIG DATA ANALYTICS SKNSCOEK

d) Relational Database Management System

6. What was Hadoop written in ?


a) Java (software platform)
b) Perl
c) Java (programming language)
d) Lua (programming language)

7. Which of the following platforms does Hadoop run on ?


a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like

8. Hadoop achieves reliability by replicating the data across multiple hosts, and hence
does not require ________ storage on hosts.
a) RAID
b) Standard RAID levels
c) ZFS
d) Operating system

9. Above the file systems comes the ________ engine, which consists of one Job Tracker,
to which client applications submit MapReduce jobs.
a) MapReduce
b) Google
c) Functional programming
d) Facebook

10. The Hadoop list includes the HBase database, the Apache Mahout ________ system,
and matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification
d) Artificial intelligence
BIG DATA ANALYTICS SKNSCOEK

UNIT – IV

1. A ________ node acts as the Slave and is responsible for executing a Task assigned to
it by the JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker

2. Point out the correct statement :


a) MapReduce tries to place the data and the compute as close as possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
d) All of the mentioned

3. ___________ part of the MapReduce is responsible for processing one or more chunks
of data and producing the output results.
a) Maptask
b) Mapper
c) Task execution
d) All of the mentioned

4. _________ function is responsible for consolidating the results produced by each of


the Map() functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned

5. Point out the wrong statement :


a) A MapReduce job usually splits the input data-set into independent chunks which
are processed by the map tasks in a completely parallel manner
b) The MapReduce framework operates exclusively on pairs
c) Applications typically implement the Mapper and Reducer interfaces to provide the
map and reduce methods
BIG DATA ANALYTICS SKNSCOEK

d) None of the mentioned

6. Although the Hadoop framework is implemented in Java , MapReduce applications


need not be written in :
a) Java
b) C
c) C#
d) None of the mentioned

7. ________ is a utility which allows users to create and run jobs with any executables
as the mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
d) None of the mentioned

8. __________ maps input key/value pairs to a set of intermediate key/value pairs.


a) Mapper
b) Reducer
c) Both Mapper and Reducer
d) None of the mentioned

9. The number of maps is usually driven by the total size of :


a) inputs
b) outputs
c) tasks
d) None of the mentioned

10. _________ is the default Partitioner for partitioning key space.


a) HashPar
b) Partitioner
c) HashPartitioner
d) None of the mentioned
BIG DATA ANALYTICS SKNSCOEK

UNIT – V

1. Mapper implementations are passed the JobConf for the job via the ________ method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
d) None of the mentioned

2. Point out the correct statement :


a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit
generated by the InputFormat for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-
len, value) format
d) All of the mentioned

3. Input to the _______ is the sorted output of the mappers.


a) Reducer
b) Mapper
c) Shuffle
d) All of the mentioned

4. The right number of reduces seems to be :


a) 0.90
b) 0.80
c) 0.36
d) 0.95

5. Point out the wrong statement :


a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases
load balancing and lowers the cost of failures
BIG DATA ANALYTICS SKNSCOEK

c) It is legal to set the number of reduce-tasks to zero if no reduction is desired


d) The framework groups Reducer inputs by keys (since different mappers may have
output the same key) in sort stage

6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
d) None of the mentioned

7. Which of the following phases occur simultaneously ?


a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
d) All of the mentioned

8. Mapper and Reducer implementations can use the ________ to report progress or just
indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned

9. __________ is a generalization of the facility provided by the MapReduce framework to


collect data output by the Mapper or the Reducer
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned

10. _________ is the primary interface for a user to describe a MapReduce job to the
Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
BIG DATA ANALYTICS SKNSCOEK

d) None of the mentioned


BIG DATA ANALYTICS SKNSCOEK

UNIT - VI

1. Which of the following scripts that generate more than three MapReduce jobs ?
a)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);
b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME');
c = for b generate group.$1, group.$2, COUNT(a);
d = filter c by $2 > 3;
dump d;
b)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = display a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME');
c = foreach b generate group.$1, group.$2, COUNT(a);
d = filter c by $2 > 3;
dump d;
c)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);
b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME');
c = foreach b generate group.$1, group.$2, COUNT(a);
d = filter c by $2 > 3;
dump d;
d) None of the mentioned

2. Point out the correct statement :


a) LoadPredicatePushdown is same as LoadMetadata.setPartitionFilter
b) getOutputFormat() is called by Pig to get the InputFormat used by the loader
c) Pig works with data from many sources
d) None of the mentioned

3. Which of the following find the running time of each script (in seconds) ?
a)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);
BIG DATA ANALYTICS SKNSCOEK

b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_na


me,
(Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
c = group b by (id, user, script_name)
d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
dump d;
b)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = for a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name,
(Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
c = group b by (id, user, script_name)
d = for c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
dump d;
c)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queu
e;
c = group b by (id, user, queue) parallel 10;
d = foreach c generate group.user, group.queue, COUNT(b);
dump d;

d) All of the mentioned

4. Which of the following script determines the number of scripts run by user and
queue on a cluster:
a)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);

b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER'


as user, j#'JOBNAME' as script_name, j#'JOBID' as job;
c = filter b by status != 'SUCCESS';
dump c;
b)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);

b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_na


me, (Long) r#'NUMBER_REDUCES' as reduces;
c = group b by (id, user, script_name) parallel 10;
BIG DATA ANALYTICS SKNSCOEK

d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;


e = filter d by max_reduces == 1;
dump e;
c)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queu
e;
c = group b by (id, user, queue) parallel 10;
d = foreach c generate group.user, group.queue, COUNT(b);
dump d;
d) None of the mentioned

5. Point out the wrong statement :


a) Pig can invoke code in language like Java Only.
b) Pig enables data workers to write complex data transformations without knowing
Java.
c) Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers
already familiar with scripting languages and SQL.
d) Pig is complete, so you can do all required data manipulations in Apache Hadoop
with Pig.

6. Which of the following script is used to check scripts that have failed jobs ?
a)

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);
b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER'
as user, j#'JOBNAME' as script_name, j#'JOBID' as job;
c = filter b by status != 'SUCCESS';
dump c;
b)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_na
me, (Long) r#'NUMBER_REDUCES' as reduces;
c = group b by (id, user, script_name) parallel 10;
d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
e = filter d by max_reduces == 1;
dump e;
c)
BIG DATA ANALYTICS SKNSCOEK

a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m


ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queu
e;
c = group b by (id, user, queue) parallel 10;
d = foreach c generate group.user, group.queue, COUNT(b);
dump d;
d) None of the mentioned

7. Which of the following code is used to find scripts that use only the default
parallelism ?
a)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER'
as user, j#'JOBNAME' as script_name, j#'JOBID' as job;
c = filter b by status != 'SUCCESS';
dump c;
b)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_na
me, (Long) r#'NUMBER_REDUCES' as reduces;
c = group b by (id, user, script_name) parallel 10;
d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
e = filter d by max_reduces == 1;
dump e;
c)
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:m
ap[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queu
e;
c = group b by (id, user, queue) parallel 10;
d = foreach c generate group.user, group.queue, COUNT(b);
dump d;
d) None of the mentioned

8. Pig Latin is _______ and fits very naturally in the pipeline paradigm while SQL is
instead declarative.
a) functional
b) procedural
c) declarative
BIG DATA ANALYTICS SKNSCOEK

d) All of the mentioned

9. In comparison to SQL, Pig uses :


a) lazy evaluation
b) ETL
c) Supports pipeline splits
d) All of the mentioned

10. Which of the following is an entry in jobconf ?


a) pig.job
b) pig.input.dirs
c) pig.feature
d) None of the mentioned
BIG DATA ANALYTICS SKNSCOEK

UNIT – VII

1. Which of the following is shortcut for DUMP operator ?


a) \de alias
b) \d alias
c) \q
d) None of the mentioned

2. Point out the correct statement:


a) Invoke the Grunt shell using the “enter” command
b) Pig does not support jar files
c) Both the run and exec commands are useful for debugging because you can modify a
Pig script in an editor
d) All of the mentioned

3. Which of the following command is used to show values to keys used in Pig ?
a) set
b) declare
c) display
d) All of the mentioned

4. Use the __________ command to run a Pig script that can interact with the Grunt
shell (interactive mode).
a) fetch
b) declare
c) run
d) All of the mentioned

5. Point out the wrong statement:


a) You can run Pig scripts from the command line and from the Grunt shell
b) DECLARE defines a Pig macro
c) Use Pig scripts to place Pig Latin statements and Pig commands in a single file
d) None of the mentioned
BIG DATA ANALYTICS SKNSCOEK

6. Which of the following command can be used for debugging ?


a) exec
b) execute
c) error
d) throw

7. Which of the following file contains user defined functions (UDFs) ?


a) script2-local.pig
b) pig.jar
c) tutorial.jar
d) excite.log.bz2

8. Which of the following is correct syntax for parameter substitution using cmd ?
a) pig {-param param_name = param_value | -param_file file_name} [-debug | -dryrun]
script
b) {%declare | %default} param_name param_value
c) {%declare | %default} param_name param_value cmd
d) All of the mentioned

9. You can specify parameter names and parameter values in one of the ways:
a) As part of a command line.
b) In parameter file, as part of a command line
c) With the declare statement, as part of Pig script
d) All of the mentioned

10. _________ are scanned in the order they are specified on the command line.
a) Command line parameters
b) Parameter files
c) Declare and default preprocessors
d) Both parameter files and command line parameters
BIG DATA ANALYTICS SKNSCOEK

UNIT – VIII

1._________ operator is used to review the schema of a relation.


a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN

2. Point out the correct statement :


a) During the testing phase of your implementation, you can use LOAD to display
results to your terminal screen
b) You can view outer relations as well as relations defined in a nested FOREACH
statement
c) Hadoop properties are interpreted by Pig
d) None of the mentioned

3. Which of the following operator is used to view the map reduce execution plans ?
a) DUMP
b) DESCRIBE
c) STORE
d) EXPLAIN

4. ___________ operator is used to view the step-by-step execution of a series of


statements.
a) ILLUSTRATE
b) DESCRIBE
c) STORE
d) EXPLAIN

5. Point out the wrong statement :


a) ILLUSTRATE operator is used to review how data is transformed through a sequence
of Pig Latin statements
b) ILLUSTRATE is based on an example generator
c) Several new private classes make it harder for external tools such as Oozie to
BIG DATA ANALYTICS SKNSCOEK

integrate with Pig statistics


d) None of the mentioned

6. __________ is a framework for collecting and storing script-level statistics for Pig
Latin.
a) Pig Stats
b) PStatistics
c) Pig Statistics
d) None of the mentioned

7. The ________ class mimics the behavior of the Main class but gives users a statistics
object back.
a) PigRun
b) PigRunner
c) RunnerPig
d) None of the mentioned

8. ___________ is a simple xUnit framework that enables you to easily test your Pig
scripts.
a) PigUnit
b) PigXUnit
c) PigUnitX
d) All of the mentioned

9. Which of the following will compile the Pigunit ?


a) $pig_trunk ant pigunit-jar
b) $pig_tr ant pigunit-jar
c) $pig_ ant pigunit-jar
d) None of the mentioned

10. PigUnit runs in Pig’s _______ mode by default.


a) local
b) tez
c) mapreduce
d) None of the mentioned
BIG DATA ANALYTICS SKNSCOEK

You might also like