The document provides definitions and explanations of key concepts in Hadoop and Big Data including:
1. Big data is large, growing collections of data that can be analyzed for insights.
2. Hadoop uses various data types like IntWritable, FloatWritable, and Text for storing different types of data as key-value pairs.
3. Applications of big data include tracking customer spending habits, recommendations, media and entertainment.
The document then goes on to explain additional Hadoop concepts such as Sqoop, MapReduce, HDFS, and common commands.
The document provides definitions and explanations of key concepts in Hadoop and Big Data including:
1. Big data is large, growing collections of data that can be analyzed for insights.
2. Hadoop uses various data types like IntWritable, FloatWritable, and Text for storing different types of data as key-value pairs.
3. Applications of big data include tracking customer spending habits, recommendations, media and entertainment.
The document then goes on to explain additional Hadoop concepts such as Sqoop, MapReduce, HDFS, and common commands.
The document provides definitions and explanations of key concepts in Hadoop and Big Data including:
1. Big data is large, growing collections of data that can be analyzed for insights.
2. Hadoop uses various data types like IntWritable, FloatWritable, and Text for storing different types of data as key-value pairs.
3. Applications of big data include tracking customer spending habits, recommendations, media and entertainment.
The document then goes on to explain additional Hadoop concepts such as Sqoop, MapReduce, HDFS, and common commands.
A) Big Data is a collection of large volumes of data that grows exponentially with time.
2. Write data types in hadoop.
A) Integer –> IntWritable: It is used to pass integer numbers as key or value. Float –> FloatWritable: It is used to pass floating point numbers as key or value. Long –> LongWritable: It is used to store long values. Short –> ShortWritable: It is used to store short values. Double –> DoubleWritable: It is used to store double values. String –> Text: It is used to pass string characters as key or value. Byte –> ByteWritable: It is used to store sequence of bytes. null –> NullWritable: It is used to pass null as a key or value.
3. List out the applications of Big Data.
A) Applications of Big Data: a. Tracking Customer Spending Habit, Shopping Behaviour. b. Recommendation c. Virtual Personal Assistant Tool d. Media and Entertainment Sector
4. What is Sqoop in Hadoop?
A) Apache Sqoop is used to import data from external datastores
into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase.
5. List out the differences between Sqoop and Hadoop.
A) Sqoop in Hadoop is mostly used to extract structured data from databases like Teradata, Oracle, etc.
6. Where the Shuffle and Sort Process Does?
A) The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key.
7. Indicate the differences between name node and data node.
A) The main difference between Name Node and Data Node in Hadoop is that the Name Node is the master node in HDFS that manages the file system metadata while the Data Node is a slave node in HDFS that stores the actual data as instructed by the Name Node.
8. State the principle of Job tracker?
A) The Job Tracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Client applications submit jobs to the Job tracker. The Job Tracker submits the work to the chosen Task Tracker nodes.
9. Name features of Task tracker?
A) A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept.
10. Write map and reduce functions.
A) The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of intermediate <key,value> pairs as output. The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.
11. Label the usage of combine and split functions.
A) Combiner Function : A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class. The main function of a Combiner is to summarize the map output records with the same key. Split Function : Split is user defined and user can control split size in his MapReduce program. One split can be mapping to multiple blocks and there can be multiple split of one block. The number of map tasks (Mapper) are equal to the number of input splits.
12. Explain Four V’s of Big Data.
A) Volume : The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes. Variety : Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data. Velocity : Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor to the velocity of growing data. Veracity : Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness.
13. What is Map Reduce?
A) MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
14. Write down the features of HDFS.
A) Features of HDFS: a. Highly – Scalable b. Portable c. Distributed Data Storage d. Replication
15. What is Flume?
A) Apache Flume is a reliable and distributed system for collecting, aggregating and moving massive quantities of log data. It has a simple yet flexible architecture based on streaming data flows.
16. List hadoop operation modes.
A) Hadoop Mainly works on 3 different Modes: a. Standalone Mode b. Pseudo – Distributed Mode c. Fully – Distributed Mode 17. What is the fair scheduler? A) Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. The scheduler organizes apps further into “queues”, and shares resources fairly between these queues.
18. Explain about Blocks in HDFS.
A) Blocks are the smallest continuous location on your hard drive where data is stored. HDFS stores each file as blocks, and distribute it across the Hadoop cluster. The default size of a block in HDFS is 128 MB.
19. Which Command is used to list the files in Hadoop?
A) The hdfs dfs -ls command to list files in Hadoop archives. Run the hdfs dfs -ls command by specifying the archive directory location.
20. Write about HAR files.
A) HAR is created from a collection of files and the archiving tool will run a MapReduce job. these Maps reduce jobs to process the input files in parallel to create an archive file.
21. Compare and contrast NoSQL vs. Relational Databases.
A) SQL databases are known as relational databases, and have a table- based data structure, with a strict, predefined schema required. NoSQL databases, or non-relational databases, can be document based, graph databases, key-value pairs, or wide-column stores.
NoSQL databases don’t require any predefined schema, allowing you to
work more freely with “unstructured data.” Relational databases are vertically scalable, but usually more expensive, whereas the horizontal scaling nature of NoSQL databases is more cost-efficient.
22.What is serialization? Why is it needed?
A) Serialization is the process of transforming structured objects into a byte stream for transmission over a network or for writing to persistent storage.
23. What is the mechanism used by HDFS to ensure data integrity?
A) Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to the block. Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data written and verify checksum when it will read that data. The seperate checksum will create for every dfs. 24. Briefly explain MapReduce data flow. A) MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel and distributed form, the data has to flow from various phases.
25.Why Is a Block in HDFS So Large?
A) HDFS blocks are huge than the disk blocks, and the explanation is to limit the expense of searching. The time or cost to transfer the data from the disk can be made larger than the time to seek for the beginning of the block by simply improving the size of blocks significantly.
26. Explain about the Key Value pairs in a Hadoop MapReduce.
A) Key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. We use Hadoop mainly for data Analysis. It deals with structured, unstructured and semi-structured data.
27.What are the two nodes operating in an HDFS cluster?
A) The two nodes operating in an HDFS Cluster are Name Node & Secondary Name Node.
28.What is structured, semi-structured, unstructured data?
A) We can classify data as structured data, semi-structured data, or unstructured data. Structured data resides in predefined formats and models, Unstructured data is stored in its natural format until it's extracted for analysis, and Semi structured data basically is a mix of both structured and unstructured data.
29. What are various File Permissions in HDFS
A) The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file.
30.What is Hadoop streaming?
A) Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 31. Define a Block. Why the HDFS block size is large than disk block? A) HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be significantly longer than the time to seek to the start of the block.
32. Define HDFS.
A) The Hadoop Distributed File System ( HDFS ) is a distributed file system designed to run on commodity hardware. ... HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
33. What is GenericOptionsParser Class?
A) GenericOptionsParser class is a utility class with in the org. apache. hadoop. util package. This class parses the standard command line arguments and sets them on a configuration object which can then be used with in the application.
34.Name the four independent entities in classic MapReduce of a job run.
A) At the highest level, there are four independent entities: • The client, which submits the MapReduce job. The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. The tasktrackers, which run the tasks that the job has been split into.
35.How the Capacity Scheduler approach differs with Fair Scheduler?
A) Fair Scheduler assigns equal amount of resource to all running jobs. When the job completes, free slot is assigned to new job with equal amount of resource. Here, the resource is shared between queues. Capacity Scheduler on the other hand, it assigns resource based on the capacity required by the organisation.
36. Write the advantage of Lazy output.
A. FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are empty. Some applications prefer not to create empty files, which is where LazyOutputFormat helps.
LazyOutputFormat is a wrapper OutputFormat. It makes sure that the
output file should create only when it emits its first record for a given partition. To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.
To enable LazyOutputFormat, streaming and pipes supports a –
lazyOutput option.
37. What is the use of MultipleOutputs?
A. MultipleOutputs class provide facility to write Hadoop map/reducer output to more than one folders. Basically, we can use MultipleOutputs when we want to write outputs other than map reduce job default output and write map reduce job output to different files provided by a user.