Big data is large datasets that cannot be processed using traditional techniques. It includes data from sources like social media, stock exchanges, power grids, search engines, and more. Hadoop is an open source framework for storing and processing big data across clusters of commodity servers. It provides benefits like cost savings, faster processing speeds, and insights to help businesses. Key components of Hadoop include HDFS for storage, MapReduce for processing, and tools like Hive, Pig, Flume, Sqoop, Zookeeper, Kafka, and HBase.
Big data is large datasets that cannot be processed using traditional techniques. It includes data from sources like social media, stock exchanges, power grids, search engines, and more. Hadoop is an open source framework for storing and processing big data across clusters of commodity servers. It provides benefits like cost savings, faster processing speeds, and insights to help businesses. Key components of Hadoop include HDFS for storage, MapReduce for processing, and tools like Hive, Pig, Flume, Sqoop, Zookeeper, Kafka, and HBase.
Big data is large datasets that cannot be processed using traditional techniques. It includes data from sources like social media, stock exchanges, power grids, search engines, and more. Hadoop is an open source framework for storing and processing big data across clusters of commodity servers. It provides benefits like cost savings, faster processing speeds, and insights to help businesses. Key components of Hadoop include HDFS for storage, MapReduce for processing, and tools like Hive, Pig, Flume, Sqoop, Zookeeper, Kafka, and HBase.
Big data is a collection of large datasets that cannot be
processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks.
What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data. Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft. Social Media Data − Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe. Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers. Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station. Transport Data − Transport data includes model, capacity, distance and availability of a vehicle. Search Engine Data − Search engines retrieve lots of data from different databases.
Benefits of Big Data
Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums. Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production. Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.
Hadoop:
Hadoop is an open source, Java based framework used
for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes.
From a business point of view, too, there are direct and
indirect benefits. By using open-source technology on inexpensive servers that are mostly in the cloud (and sometimes on-premises), organizations achieve significant cost savings. Additionally, the ability to collect massive data, and the insights derived from crunching this data, results in better business decisions in the real-world—such as the ability to focus on the right consumer segment, weed out or fix erroneous processes, optimize floor operations, provide relevant search results, perform predictive analytics, and so on. How Hadoop Improves on Traditional Databases Hadoop solves two key challenges with traditional databases:
1. Capacity: Hadoop stores large
volumes of data. By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations, these are economical and easily scalable as the data grows.
2. Speed: Hadoop stores and retrieves
data faster. Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the application, drastically improving the processing speed.
The Hadoop Ecosystem: Supplementary Components
The following are a few supplementary components that are extensively used in the Hadoop ecosystem.
Hive: Data Warehousing
Hive is a data warehousing system that helps to query large datasets in the HDFS. Before Hive, developers were faced with the challenge of creating complex MapReduce jobs to query the Hadoop data. Hive uses HQL (Hive Query Language), which resembles the syntax of SQL. Since most developers come from a SQL background, Hive is easier to get on-board. The advantage of Hive is that a JDBC/ODBC driver acts as an interface between the application and the HDFS. Originally developed by the Facebook team, Hive is now an open source technology.
Pig: Reduce MapReduce Functions
Pig, initially developed by Yahoo!, is similar to Hive in that it eliminates the need to create MapReduce functions to query the HDFS. Similar to HQL, the language used — here, called “Pig Latin” — is closer to SQL. “Pig Latin” is a high-level data flow language layer on top of MapReduce. Pig also has a runtime environment that interfaces with HDFS. Scripts in languages such as Java or Python can also be embedded inside Pig.
Flume: Big Data Ingestion
Flume is a big data ingestion tool that acts as a courier service between multiple data sources and the HDFS. It collects, aggregates, and sends huge amounts of streaming data (e.g. log files, events) generated by applications such as social media sites, IoT apps, and ecommerce portals into the HDFS. Flume is feature-rich, it: Has a distributed architecture. Ensures reliable data transfer. Is fault-tolerant. Has the flexibility to collect data in batches or real- time. Can be scaled horizontally to handle more traffic, as needed. Data sources communicate with Flume agents — every agent has a source, channel, and a sink. The source collects data from the sender, the channel temporarily stores the data, and finally, the sink transfers data to the destination, which is a Hadoop server.
Sqoop: Data Ingestion for Relational
Databases Sqoop (“SQL,” to Hadoop) is another data ingestion tool like Flume. While Flume works on unstructured or semi-structured data, Sqoop is used to export data from and import data into relational databases. As most enterprise data is stored in relational databases, Sqoop is used to import that data into Hadoop for analysts to examine. Database admins and developers can use a simple command line interface to export and import data. Sqoop converts these commands to MapReduce format and sends them to the HDFS using YARN. Sqoop is also fault-tolerant and performs concurrent operations like Flume.
Zookeeper: Coordination of Distributed
Applications Zookeeper is a service that coordinates distributed applications. In the Hadoop framework, it acts as an admin tool with a centralized registry that has information about the cluster of distributed servers it manages. Some of its key functions are: Maintaining configuration information (shared state of configuration data) Naming service (assignment of name to each server) Synchronization service (handles deadlocks, race condition, and data inconsistency) Leader election (elects a leader among the servers through consensus) The cluster of servers that the Zookeeper service runs on is called an “ensemble.” The ensemble elects a leader among the group, with the rest behaving as followers. All write-operations from clients need to be routed through the leader, whereas read operations can go directly to any server. Zookeeper provides high reliability and resilience through fail-safe synchronization, atomicity, and serialization of messages.
Kafka: Faster Data Transfers
Kafka is a distributed publish-subscribe messaging system that is often used with Hadoop for faster data transfers. A Kafka cluster consists of a group of servers that act as an intermediary between producers and consumers. In the context of big data, an example of a producer could be a sensor gathering temperature data to relay back to the server. Consumers are the Hadoop servers. The producers publish message on a topic and the consumers pull messages by listening to the topic. A single topic can be split further into partitions. All messages with the same key arrive to a specific partition. A consumer can listen to one or more partitions. By grouping messages under one key and getting a consumer to cater to specific partitions, many consumers can listen on the same topic at the same time. Thus, a topic is parallelized, increasing the throughput of the system. Kafka is widely adopted for its speed, scalability, and robust replication.
HBase: Non-Relational Database
HBase is a column-oriented, non-relational database that sits on top of HDFS. One of the challenges with HDFS is that it can only do batch processing. So for simple interactive queries, data still has to be processed in batches, leading to high latency. HBase solves this challenge by allowing queries for single rows across huge tables with low latency. It achieves this by internally using hash tables. It is modelled along the lines of Google BigTable that helps access the Google File System (GFS). HBase is scalable, has failure support when a node goes down, and is good with unstructured as well as semi-structured data. Hence, it is ideal for querying big data stores for analytical purposes.
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!