Bda 4

UNIT 4 Information Integration
What Does Data Retrieval Mean? Information integration (II) is the merging of information from heterogeneous sources with differing conceptual, contextual
Big data processing is a set of techniques or programming models to access large-scale data to extract useful and typographical representations. It is used in data mining and consolidation of data from unstructured or semi-structured
In databases, data retrieval is the process of identifying and extracting data from a database, based on
information for supporting and providing decisions. resources.
a query provided by the user or application. Typically, information integration refers to textual representations of knowledge but is sometimes applied
Six stages of data processing to rich-media content. Information fusion, which is a related term, involves the combination of information into a new set of
● Data collection. Collecting data is the first step in data processing. ... Retrieving data using SQL queries information towards reducing redundancy and uncertainty.[1]
● Data preparation. Once the data is collected, it then enters the data preparation stage. ... Examples of technologies available to integrate information include de duplication, and string metrics which allow the
Data input. ... You can create SQL queries to search for messages and display their contents. For example, to select
● detection of similar text in different data sources by fuzzy matching. A host of methods for these research areas are available
● Processing. ... all messages with the status ERROR, issue the following query: such as those presented in the International Society of Information Fusion. Other methods rely on causal estimates of the
● Data output/interpretation. ... outcomes based on a model of the sources.
Data storage. ... SELECT MWH_WMQI_MSG_ID, MWH_MSG_GRP FROM schema.DNIV_MWH_ou
● Information integration is an on-going challenge in data management and various approaches have been proposed in
Become a data processing master WHERE MWH_MSG_STATUS = 'ERROR'
● database research.
What Is a Big Data Pipeline—And Why It Matters
What Is a Big Data Pipeline?
where schema represents the schema name of the runtime database, which is set by the DNIvSN placeholder.
Six stages of data processing Note: Querying the message warehouse can affect the overall performance of your FTM SWIFT instance. A poorly designed
Big data pipelines are like everyday data pipelines, helping move data from one point to another, transforming it before it
reaches its destination. However, these everyday data pipelines become unreliable when data systems experience sudden
1. Data collection query might even lock out other FTM SWIFT services. Therefore, observe the following recommendations:
growth or a need to add more data sources arises. Downtime or performance degradation can occur. Hence the need for big
● For your queries, create an index or a small number of indexes on the base table of the message warehouse.
data pipelines.
Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and Otherwise, even a simple query on the message warehouse forces DB2® to scan the entire table.
A big data pipeline offers businesses the following advantages:
data warehouses. It is important that the data sources available are trustworthy and well-built so the data ● If you search the LOB column, specify additional selection criteria to limit the scope of the query so that the LOB
● Scalability: Big data pipelines can serve large and growing data sets, which is crucial for growth as its architecture
collected (and later used as information) is of the highest possible quality. value is checked only for a small number of rows.
automatically scales to accommodate business-changing needs.
● Never search or retrieve the values in the LOB column for a large number of entries while FTM SWIFT is running.
2. Data preparation ● Flexibility: Health, finance, banking, construction, and other large organizations today rely on many data sources;
these sources rarely produce data of a single data type. Big data pipelines can ingest and process structured, unstructured,
Data retrieval in SQL refers to the process of obtaining or extracting specific data from a database using the SQL (Structured
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as and semi-structured data, and process this data in streams, batches, or other methods.
Query Language) programming language. It involves issuing queries to the database system to fetch information that
“pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data ● Reliable architecture: Most big data pipelines use cloud technology to build a highly tolerant, distributed architecture
matches certain conditions or criteria.
processing. During preparation, raw data is diligently checked for any errors. The purpose of this step is to that ensures continuous availability to mitigate the impacts of any failure along the pipeline.
Using SQL commands such as SELECT, FROM, WHERE, JOIN, GROUP BY, HAVING, and ORDER BY, data retrieval allows
eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-quality data for the ● Timely data processing: These pipelines offer real-time processing, allowing businesses to ingest and extract
users to specify what data they want to retrieve and under what conditions. The SELECT statement is commonly used to
best business intelligence. insights quickly for timely decision-making.
specify the columns to be retrieved, the FROM clause identifies the table(s) from which to retrieve data, and the WHERE
3. Data input clause filters the data based on specific conditions.
Data retrieval in SQL is essential for retrieving relevant and specific information from large datasets stored in databases. It
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse enables users to extract and manipulate data according to their needs, perform calculations, apply filters, combine data from
like Redshift), and translated into a language that it can understand. Data input is the first stage in which raw data multiple tables, group data, and sort the results. SQL’s data retrieval capabilities provide flexibility and efficiency in accessing Components of a Big Data Pipeline
begins to take the form of usable information. and analyzing data stored in a database. A big data pipeline typically involves several components, each responsible for specific tasks within the data processing flow.
1. Data sources are the data origins, where data lives, and the primary extraction point. Popular data sources include
4. Processing streaming devices like IoT and medical wearables, APIs, databases like CRMs, and files. These data sources produce data
of numerous types, like CSV, JSON, or XML.
During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation. 2. Data ingestion/extraction: This component extracts the data from sources using an ingestion or ETL tool or data
Processing is done using machine learning algorithms, though the process itself may vary slightly depending on integration platform. Your choice of ingestion tool depends on the data sources and the type of data generated by the
the source of data being processed (data lakes, social networks, connected devices etc.) and its intended use sources.
(examining advertising patterns, medical diagnosis from connected devices, determining customer needs, etc.). 3. Data storage: After extraction, data is transported to a central storage repository before use in analysis, ML, or data
science cases. However, these storage locations may vary depending on the business use case. For example, data may be
5. Data output/interpretation stored in an intermediate storage/Operational Data Store (ODS) to serve business transactional purposes for providing a
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is translated, current and updated data state for more accurate business reporting. Furthermore, you can store the data in target storage
readable, and often in the form of graphs, videos, images, plain text, etc.). Members of the company or institution locations like data warehouses or lakes to serve other business intelligence and analytics purposes.
can now begin to self-serve the data for their own data analytics projects. 4. Data processing is where raw data is transformed into a high-quality, clean, and robust dataset. This component
performs numerous data tasks and may involve constructive (adding or replicating data), destructive (removing null values,
6. Data storage outliers, duplicated data), or structural (renaming fields, combining columns) data transformation tasks to prepare data for
analytics.
The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. 5. Data analysis and visualization: This component uses different methods like statistical techniques, ML algorithms,
While some information may be put to use immediately, much of it will serve a purpose later on. Plus, properly and others to identify patterns, relationships, and trends existing within your data and communicates the results in digestible,
stored data is a necessity for compliance with data protection legislation like GDPR. When data is properly accessible, and readable formats like graphs, charts, or models.
stored, it can be quickly and easily accessed by members of the organization when needed. 6. Workflow: Workflow defines how every step in your pipeline should proceed. Orchestrators manage the order of
operations, handle failures, and do scheduling.
7. Monitoring: Monitoring ensures the health, performance, and success of your pipeline. Automated alerts to pipeline
administrators when errors occur are important.
The future of data processing
8. Destination: Your pipeline destination may be a data store, data warehouse, data lake, or BI application, depending
The future of data processing lies in the cloud. Cloud technology builds on the convenience of current on the end-use case for data.
The specific components you’ll use can vary based on the unique requirements of the project, existing technical
electronic data processing methods and accelerates its speed and effectiveness. Faster, higher-quality infrastructure, data types and sources, and the specific objectives of the pipeline.
data means more data for each organization to utilize and more valuable insights to extract. Big Data Pipeline Architecture
Depending on your business needs, your big data pipeline architecture may be any of the following:
As big data migrates to the cloud, companies are realizing huge benefits. Big data cloud technologies Streaming Data Architecture
allow for companies to combine all of their platforms into one easily-adaptable system. As software Streaming architecture serves businesses requiring ultra-low latency for their transactions. Streaming architecture pipelines
changes and updates (as it does often in the world of big data), cloud technology seamlessly integrates process data in real-time, allowing companies to act on insights before they lose value. Financial, health, manufacturing, and
the new with the old. IoT device data rely on streaming big data pipelines for improving customer experiences via segmentation, predictive
maintenance, and monitoring.
Retrieving data query and retrieval
Batch Architecture ● Tableau: Tableau is like an artist that turns data into beautiful pictures. The World Bank uses it to create interactive Working of Data aggregators:
Unlike streaming architecture, batch architecture extracts and processes data on defined intervals or a trigger. This is best for charts and graphs that help people understand complex economic data.
workloads/use cases with no need for immediate data analysis, like payroll processing, or for e-commerce businesses for ● Python and R: Python and R are like magic tools for data scientists. They use these languages to solve tricky WORKING OF DATA AGGREGATORS
handling inventory at intervals. problems. For example, Kaggle uses them to predict things like house prices based on past data. The working of data aggregators takes place in three steps:
● Machine Learning Frameworks (e.g., TensorFlow): In Machine learning frameworks are the tools who make ● Collection of data: Collecting data from different datasets from the enormous database. The data can be extracted
Change Data Capture predictions. Airbnb uses TensorFlow to predict which properties are most likely to be booked in certain areas. It helps hosts using IoT(internet of things) such as
CDC is employed in streaming architecture. It helps keep systems in sync while conserving network and compute resources. make smart decisions about pricing and availability. ● Communications in social media
Every new ingestion only loads newly changed data since the last ingestion instead of loading the entire data set. These tools and technologies are the building blocks of Big Data Analytics and helps organizations gather, process, ● Speech recognition like call centers
understand, and visualize data, making it easier for them to make decisions based on information. ● Headlines of a news
Lambda Architecture ● Browsing history and other personal data of devices.
Lambda architecture is a hybrid method that combines streaming and batch processing for processing data. However, ● Processing of data: After collecting data, the data aggregator finds the atomic data and aggregates it. In the
pipeline management becomes very complex because this architecture uses two separate layers for streaming and batch Benefits of Big Data Analytics processing technique, aggregators use various algorithms from the field of Artificial Intelligence or Machine learning
processing. Big Data Analytics offers a host of real-world advantages, and let’s understand with examples: techniques. It also incorporates statistical methods to process it, like the predictive analysis. By this, various useful insights
Key Considerations When Building Big Data Pipelines 1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps them make smart choices about what can be extracted from raw data.
Building your big data pipeline involves answering multiple questions regarding data quality, safety, governance, errors, and products to stock. This not only reduces waste but also keeps customers happy and profits high. ● Presentation of data: After the processing step, the data will be in a summarized format which can provide a
relevance for business use, thus making pipeline design challenging. Some factors to consider include: 2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is what makes those product desirable statistical result with detailed and accurate data.
● Adaptability and scalability: Imagine a business that provides financial trading information to users via an suggestions so accurate. It’s like having a personal shopper who knows your taste and helps you find what you want. Choice of manual or automated data aggregators:
application. What happens if such an application suddenly goes viral? Will your pipeline be able to handle all the new users? 3. Fraud Detection: Credit card companies, like MasterCard, use Big Data Analytics to catch and stop fraudulent Data aggregation can also be done by manual method. When one starts a new company, one can opt manual aggregator by
What if you need to add new features? Your pipelines’ adaptability and easy scalability ensure that systems remain transactions. It’s like having a guardian that watches over your money and keeps it safe. using excel sheets and by creating charts to manage performance, budget, marketing etc.
operational despite increased demand. 4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver your packages faster and with less Data aggregation in a well-established company calls the need for middleware, a third party software to implement the data
● Broad compatibility with data sources and integration tools: As data sources increase, integrating these new impact on the environment. It’s like taking the fastest route to your destination while also being kind to the planet. automatically using tools of marketing.
sources may pose risks, causing errors that affect pipeline performance downstream. Selecting tools with broad compatibility But when large datasets are encountered, a Data Aggregator system is a need to provide accurate results.
that seamlessly integrate and can handle a wide range of data sources is vital. Types of data aggregation
● Performance: Performance is greatly dependent on latency, and depending on your business goals, the need for Aggregation operations in big data ● Time aggregation: It provides the data point for single resources for a defined time period.
timeliness for data delivery differs. For example, manufacturing, health, finance, and other industries that rely on immediate, Data aggregation is the process of collecting data to present it in summary form. This information is then used to conduct ● Spatial aggregation: It provided the data point for a group of resources for a defined time period.
real-time insights require ultra-low latency to make sure operations proceed smoothly. Ensuring no delays in extracting data statistical analysis and can also help company executives make more informed decisions about marketing strategies, price Time intervals for data aggregation process:
from the source is vital to maximizing the performance of your pipelines. settings, and structuring operations, among other things. ● Reporting period: The period in which the data is collected for presentation. It can either be a data point aggregated
● Data quality: Your data product and quality of analytics are only as good as the quality of your data, and numerous Aggregation in data mining is the process of finding, collecting, and presenting the data in a summarized format to perform process or simply raw data. E.g. The data is collected and processed into a summarized format in a period of one day from a
data challenges, like the presence of bad data, outliers, and duplicated data, affect your pipeline. Employing mechanisms to statistical analysis of business schemes or analysis of human patterns. When numerous data is collected from various network device. Hence the reporting period will be one day.
catch these, data drift, and other issues that challenge the quality of your data preserves the effectiveness and accuracy of datasets, it’s crucial to gather accurate data to provide significant results. Data aggregation can help in taking prudent ● Granularity: The period in which data is collected for aggregation. E.g. To find the sum of data points for a specific
analytics results. decisions in marketing, finance, pricing the product, etc. resource collected over a period of 10 mins. Here the granularity would be 10 mins. The value of granularity can vary from
● Security and governance: Your pipeline needs to be secure, keeping malicious attackers at bay while providing How does Data aggregation work: minute to month depending upon the reporting period.
access to authorized individuals. Preventive measures like access and privacy controls, encryption while data is in transit and Data Aggregation is a need when a dataset as a whole is useless information and cannot be used for analysis. So, the ● Polling period: The frequency in which resources are sampled for data. E.g. If the group of resources can be polled
at rest, tracking, and audit trailing help secure and track access history for your business data. datasets are summarized into useful aggregates to acquire desirable results and also to enhance the user experience or the every 7 minutes which means data points for each resource is generated every 7 minutes. Polling period and Granularity
● Cost optimization: Building and maintaining data pipelines is a continuous process, requiring updates and application itself. They provide aggregate measurements such as sum, count and average. Summarized data helps in the comes under spatial aggregation.
reconfiguration to improve efficiency further, especially with business data changes and volume increases. demographic study of customers, their behavior patterns. Aggregated data help in finding useful information about a group Applications of Data Aggregation:
● Data drift: Data drift breaks your pipeline and often reduces the quality of analytics or predictive accuracy of your ML after they are written as reports. It also helps in data lineage to understand, record and visualize data which in turn help in ● Data aggregation is used in many fields where a large number of datasets are involved. It helps in making fruitful
models. Data drift results from unexpected changes in the structure or semantics of your data and can break your pipelines. tracing the root cause of errors in data analytics. There is no specific need for an aggregated element to be number. We can decisions in marketing or finance management. It helps in the planning and pricing of products.
Smart data pipelines offer a way to mitigate this, as they detect any schema/data structure changes and alert your team if any also find the count of non-numeric data. Aggregation must be done for a group of data and not based on individual data. ● Efficient use of data aggregation can help in the creation of marketing schemes. E.g. If the company is performing
new data violates your configured rules. ad campaigns on a particular platform, they must deeply analyze the data to raise sales. The aggregation can help in
● Reliability: Reliability ensures your pipelines proceed as expected, without errors or interruptions. Modern data analyzing the execution over a respective time period of campaigns or a particular cohort or a particular channel/platform.
pipelines employ a distributed architecture that redistributes loads in failover cases to ensure continuous availability. Examples of aggregate data: This can be done in three steps namely Extraction, Transform, Visualize.
Analytical operations in big data ● Finding the average age of customer buying a particular product which can help in finding out the targeted age group
Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts of raw data to help for that particular product. Instead of dealing with an individual customer, the average age of the customer is calculated. Workflow of Data Analysis in SaaS Applications.
make data-informed decisions. These processes use familiar statistical analysis techniques—like clustering and ● Finding the number of consumers by country. This can increase sales in the country with more buyers and help the ● Data aggregation plays a major role in retail and e-commerce industries by monitoring the competitive price. In this
regression—and apply them to more extensive datasets with the help of newer tools. company to enhance its marketing in a country with low buyers. Here also, instead of an individual buyer, a group of buyers field, to keeping track of its fellow company is a must. Like a company should collect details of pricing, offers etc. of other
in a country are considered. companies to know what its competitive company is up to. This can be done by aggregating data from a single resource like
What is big data analytics? ● By collecting the data from online buyers, the company can analyze the consumer behavior pattern, the success of its competitor website.
Big data analytics is the often complex process of examining big data to uncover information -- such as hidden patterns, the product which helps the marketing and finance department to find new marketing strategies and planning the budget. ● Data aggregation plays an impactful role in the travel industry. It comprises research about the competitor and
correlations, market trends and customer preferences -- that can help organizations make informed business decisions. ● Finding the value of voter turnout in a state or country. It is done by counting the total votes of a candidate in a gaining intelligence in marketing to reach people, image capture from their travel websites. It also includes customer
On a broad scale, data analytics technologies and techniques give organizations a way to analyze data sets and gather new particular region instead of counting the individual voter records. sentiment analysis which helps to find the emotions and satisfaction based on linguistic analyses. Failed data aggregation in
information. Business intelligence (BI) queries answer basic questions about business operations and performance. Data aggregators: this field can lead to the declined growth of the travel company.
Big data analytics is a form of advanced analytics, which involve complex applications with elements such as predictive Data Aggregators are a system in data mining that collects data from numerous sources, then processes the data and ● For the business analysis purpose, the data can be aggregated into summary formats which can help the head of
models, statistical algorithms and what-if analysis powered by analytics systems. repackages them into useful data packages. They play a major role in improving the data of customer by acting as an agent. the firm to take correct decisions for satisfying the customers.
An example of big data analytics can be found in the healthcare industry, where millions of patient records, medical claims, It helps in the query and delivery process where the customer requests data instances about a certain product. The
clinical results, care management records and other data must be collected, aggregated, processed and analyzed. aggregators provide the customer with matched records of the product. Thereby the customer can buy any instances of
Big data analytics is used for accounting, decision-making, predictive analytics and many other purposes. This data varies matched records.
greatly in type, quality and accessibility, presenting significant challenges but also offering tremendous benefits.
Big Data Analytics Technologies and Tools
Big Data Analytics relies on various technologies and tools that might sound complex, let’s simplify them:
● Hadoop: Imagine Hadoop as an enormous digital warehouse. It’s used by companies like Amazon to store tons of
data efficiently. For instance, when Amazon suggests products you might like, it’s because Hadoop helps manage your
shopping history.
● Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly analyze what you watch and recommend
your next binge-worthy show.
● NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing cabinets that Airbnb uses to store your
booking details and user data. These databases are famous because of their quick and flexible, so the platform can provide
you with the right information when you need it.
High level operations in big data Features of Xplenty: it’s an open-source platform but has a limitation of adding 10000 data rows and a single logical processor. With the help of
Here is the list of the top 14 industries using big data applications: ● Rest API: A user can possibly do anything by implementing Rest API Rapid Miner, one can easily deploy their ML models to the web or mobile (only when the user interface is ready to collect
1. Banking and Securities ● Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce. real-time figures).
2. Communications, Media and Entertainment ● Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms and certificates Features of Rapid Miner:
3. Healthcare Providers regularly. ● Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
4. Education ● Deployment: It offers integration apps for both cloud & in-house and supports deployment to integrate apps over the ● Storage: Users can access cloud storage facilities such as AWS and dropbox
5. Manufacturing and Natural Resources cloud. ● Data validation: Rapid miner enables the visual display of multiple results in history for better evaluation.
6. Government Big data workflow management: Refer
7. Insurance 5. Spark (https://www.naukri.com/code360/library/understanding-big-data-workflows)
8. Retail and Wholesale trade APACHE Spark is another framework that is used to process data and perform numerous tasks on a large scale. It is also
9. Transportation used to process data via multiple computers with the help of distributing tools. It is widely used among data analysts as it
10. Energy and Utilities offers easy-to-use APIs that provide easy data pulling methods and it is capable of handling multi-petabytes of data as
11. Big Data & Auto Driving Car well. Recently, Spark made a record of processing 100 terabytes of data in just 23 minutes which broke the previous world
12. Big Data in IoT record of Hadoop (71 minutes). This is the reason why big tech giants are moving towards spark now and is highly suitable
13. Big Data in Marketing for ML and AI today.
14. Big Data in Business Insights Features of APACHE Spark:
FOR Explanation refer: (https://www.simplilearn.com/tutorials/big-data-tutorial/big-data-applications) ● Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
Tools and systems in big data ● Real-time Processing: Spark can handle real-time streaming via Spark Streaming Hadoop is an open source software programming framework for storing a large amount of data and performing the
● Flexible: It can run on, Mesos, Kubernetes, or the cloud. computation. Its framework is based on Java programming with some native code in C and shell scripts.Hadoop is an
open-source software framework that is used for storing and processing large amounts of data in a distributed computing
There are hundreds of data analytics tools out there in the market today but the selection of the right tool will depend upon 6. Mongo DB environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for the
your business NEED, GOALS, and VARIETY to get business in the right direction. Now, let’s check out the top 10 analytics Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL) database that is used to store parallel processing of large datasets.Hadoop has two main components:1)HDFS (Hadoop Distributed File System): This is
tools in big data. a high volume of data. It uses collections and documents for storage and its document consists of key-value pairs which are the storage component of Hadoop, which allows for the storage of large amounts of data across multiple machines. It is
considered a basic unit of Mongo DB. It is so popular among developers due to its availability for multi-programming designed to work with commodity hardware, which makes it cost effective. 2)YARN (Yet Another Resource Negotiator): This
1. APACHE Hadoop languages such as Python, Jscript, and Ruby. is the resource management component of Hadoop, which manages the allocation of resources (such as CPU and memory)
It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster system that Features of Mongo DB: for processing the data stored in HDFS.Features of hadoop:1. it is fault tolerance.2. it is highly available.3. it’s programming is
allows the system to process data efficiently and let the data run parallel. It can process both structured and unstructured ● Written in C++: It’s a schema-less DB and can hold varieties of documents inside. easy.4. it have huge flexible storage.5. it is low cost.Advantages:1)Ability to store a large amount of data.2)High
data from one server to multiple computers. Hadoop also offers cross-platform support for its users. Today, it is the best big ● Simplifies Stack: With the help of mongo, a user can easily store files without any disturbance in the stack. flexibility.3)Cost effective.4)High computational power.5)Tasks are independent.6)Linear scaling.Disadvantages:1)Not very
data analytic tool and is popularly used by many tech giants such as Amazon, Microsoft, IBM, etc. ● Master-Slave Replication: It can write/read data from the master and can be called back for backup. effective for small data.2)Hard cluster management.3)Has stability issues.4)Security concerns.Hadoop cluster is also a
Features of Apache Hadoop: collection of various commodity hardware(devices that are inexpensive and amply available). This Hardware components
● Free to use and offers an efficient storage solution for businesses. 7. Apache Storm work together as a single unit. In the Hadoop cluster, there are lots of nodes (can be computer and servers) contains Master
● Offers quick access via HDFS (Hadoop Distributed File System). A storm is a robust, user-friendly tool used for data analytics, especially in small companies. The best part about the storm is and Slaves, the Name node and Resource Manager works as Master and data node, and Node Manager works as a Slave.
● Highly flexible and can be easily implemented with MySQL, and JSON. that it has no language barrier (programming) in it and can support any of them. It was designed to handle a pool of large The purpose of Master nodes is to guide the slave nodes in a single Hadoop cluster. Types of Hadoop clusters-1. Single
● Highly scalable as it can distribute a large amount of data in small segments. data in fault-tolerance and horizontally scalable methods. When we talk about real-time data processing, Storm leads the Node Hadoop Cluster= In Single Node Hadoop Cluster as the name suggests the cluster is of an only single node which
● It works on small commodity hardware like JBOD or a bunch of disks. chart because of its distributed real-time big data processing system, due to which today many tech giants are using means all our Hadoop Daemons i.e. Name Node, Data Node, Secondary Name Node, Resource Manager, Node Manager
APACHE Storm in their system. Some of the most notable names are Twitter, Zendesk, NaviSite, etc. will run on the same system or on the same machine. 2. Multiple Node Hadoop Cluster: In multiple node Hadoop
Features of Storm: clusters as the name suggests it contains multiple nodes. In this kind of cluster set up all of our Hadoop Daemons, will store
2. Cassandra ● Data Processing: Storm process the data even if the node gets disconnected in different-different nodes in the same cluster setup.
APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of data. It’s one of ● Highly Scalable: It keeps the momentum of performance even if the load increases
the most popular tools for data analytics and has been praised by many tech companies due to its high scalability and ● Fast: The speed of APACHE Storm is impeccable and can process up to 1 million messages of 100 bytes on a
availability without compromising speed and performance. It is capable of delivering thousands of operations every single node.
second and can handle petabytes of resources with almost zero downtime. It was created by Facebook back in 2008 and MapReduce:
was published publicly. 8. SAS
Features of APACHE Cassandra: Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data scientist can MapReduce is a programming model and associated implementation for processing and generating large datasets in parallel
● Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-structured, and allows users mine, manage, extract or update data in different variants from different sources. Statistical Analytical System or SAS allows across a distributed cluster of computers. The core idea behind MapReduce is to divide the data processing task into smaller
to change as per their needs. a user to access the data in any format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform for sub-tasks that can be executed in parallel across multiple nodes, and then aggregating the results to produce the final output.
● Data Distribution System: Easy to distribute data with the help of replicating data on multiple data centers. business analytics called SAS Viya and also to get a strong grip on AI & ML, they have introduced new tools and products.
● Fast Processing: Cassandra has been designed to run on efficient commodity hardware and also offers fast storage Features of SAS: Key Components of MapReduce:
and data processing. ● Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries which make it suitable for
● Fault-tolerance: The moment, if any node fails, it will be replaced without any delay. non-programmers 1. Mapper: The Mapper is responsible for processing the input data and generating intermediate key-value pairs. It applies a
● Vast Data Format: It provides support for many programming languages which also include SQL and carries the user-defined function (map function) to each input record and emits intermediate key-value pairs.
3. Qubole ability to read data from any format.
It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc analysis in machine learning. ● Encryption: It provides end-to-end security with a feature called SAS/SECURE. 2. Reducer: The Reducer receives the intermediate key-value pairs generated by the Mapper and performs a user-defined
Qubole is a data lake platform that offers end-to-end service with reduced time and effort which are required in moving data aggregation operation on these pairs. It combines the values associated with the same intermediate key and produces the
pipelines. It is capable of configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also helps in 9. Data Pine final output.
lowering the cost of cloud computing by 50%. Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of time, it has gained
Features of Qubole: much popularity in a number of countries and it’s mainly used for data extraction (for small-medium companies fetching data 3. Partitioner: The Partitioner determines which Reducer instance will receive each intermediate key-value pair. It ensures
● Supports ETL process: It allows companies to migrate data from multiple sources in one place. for close monitoring). With the help of its enhanced UI design, anyone can visit and check the data as per their requirement that all key-value pairs with the same key are processed by the same Reducer, enabling efficient aggregation.
● Real-time Insight: It monitors user’s systems and allows them to view real-time insights and offer in 4 different price brackets, starting from $249 per month. They do offer dashboards by functions, industry, and
● Predictive Analysis: Qubole offers predictive analysis so that companies can take actions accordingly for targeting platform. 4. InputSplit: The InputSplit represents a chunk of input data that is processed by a single Mapper. It partitions the input
more acquisitions. Features of Datapine: dataset into manageable chunks, which are processed in parallel by different Mapper instances.
● Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced security system and also ● Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and BI tools.
ensures to protect any future breaches. Besides, it also allows encrypting cloud data from any potential threat. ● Predictive Tool: datapine provides forecasting/predictive analytics by using historical and current data, it derives the 5. OutputFormat: The OutputFormat specifies the format of the final output produced by the Reducer. It defines how the
future outcome. output key-value pairs are serialized and written to the output storage.
4. Xplenty ● Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a wide range of solutions for sales, Workflow of MapReduce:
marketing, and support. With the help of its interactive graphical interface, it provides solutions for ETL, ELT, etc. The best 10. Rapid Miner
part of using Xplenty is its low investment in hardware & software and its offers support via email, chat, telephonic and It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users aren’t required to 1. Input Data Distribution: The input data is partitioned into smaller chunks, known as InputSplits, which are distributed across
virtual meetings. Xplenty is a platform to process data for analytics over the cloud and segregates all the data together. code for segregating data. Today, it is being heavily used in many industries such as ed-tech, training, research, etc. Though the cluster.
Hive Query Language (HiveQL) is a query language in Apache Hive for
2. Map Phase: Each Mapper processes its assigned InputSplit independently. It applies the map function to each input record processing and analyzing structured data. It separates users from the
and emits intermediate key-value pairs. complexity of Map Reduce programming. It reuses common concepts
from relational databases, such as tables, rows, columns, and schema,
3. Shuffle and Sort Phase: The intermediate key-value pairs generated by the Mappers are shuffled and sorted based on to ease learning. Hive provides a CLI for Hive query writing using Hive
their keys. This phase ensures that all values associated with the same key are grouped together and ready for aggregation Query Language (HiveQL).types of Built-in Operators in
by the Reducers. HiveQL:1)Relational Operators-We use Relational operators for
relationship comparisons between two operands.Operators such as
4. Reduce Phase: Each Reducer receives a subset of intermediate key-value pairs. It applies the reduce function to equals, Not equals, less than, greater than …etc.The operand types are
aggregate the values associated with each key and produce the final output. all number types in these Operators.2)Arithmetic Operators-We use
Arithmetic operators for performing arithmetic operations on
Significance of MapReduce in Big Data Analytics: operands.Arithmetic operations such as addition, subtraction,
multiplication and division between operands we use these
1. Scalability: MapReduce enables the processing of massive datasets by distributing the workload across a large number of Operators.The operand types all are number types in these
nodes in a cluster. It can seamlessly scale to handle petabytes of data without sacrificing performance. Operators3)Logical Operators-We use Logical operators for performing
Logical operations on operands.Logical operations such as AND, OR,
2. Fault Tolerance: MapReduce frameworks like Hadoop provide built-in fault tolerance mechanisms, ensuring that data NOT between operands we use these Operators.The operand types all are
processing tasks continue to execute even in the presence of node failures or network issues. BOOLEAN type in these Operators4)Operators on Complex
types5)Complex type Constructors.
3. Parallel Processing: MapReduce leverages parallel processing to speed up data processing tasks by executing multiple
tasks concurrently across distributed nodes. This parallelism enables significant reductions in processing time for large-scale
data analytics.
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of Hadoop. It
4. Flexibility: MapReduce offers flexibility in programming by allowing developers to express complex data processing tasks residing in distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
using simple map and reduce functions. It supports a wide range of data processing workflows, including ETL (Extract, converted to MapReduce jobs.Using Hive, we can skip the requirement of the traditional approach of writing
Transform, Load), data aggregation, and machine learning algorithms. complex MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation Language
(DML), and User Defined Functions (UDF).Features of Hive:1)Hive is fast and scalable.2)It provides SQL-like
Applications of MapReduce: queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark jobs.3)It is capable of analyzing
large datasets stored in HDFS.4)It allows different storage types such as plain text, RCFile, and HBase.5)It
1. Data Warehousing: MapReduce is widely used for processing and analyzing large volumes of structured and unstructured uses indexing to accelerate queries.Limitations of Hive:1)Hive is not capable of handling real-time data.2)It is
data in data warehousing applications. It facilitates tasks such as data cleaning, transformation, and aggregation to prepare not designed for online transaction processing.3)Hive queries contain high latency.Benefits of Hive :1)Easy
data for analytical queries. to-use-Hive in Big Data is an easy-to-use software application that lets one analyze large-scale data through
the batch processing technique. An efficient program, it uses a familiar software that uses HiveQL, a
2. Log Processing: MapReduce is employed in log processing applications to analyze web server logs, system logs, and language that is very similar to SQL- structured query language used for interaction with databases. 2)Fast
application logs for insights into user behavior, system performance, and security monitoring. Experience-The technique of batch processing refers to the analysis of data in bits and parts that are later
clubbed together. Moreover, the analyzed data is sent to Apache Hadoop, while the schemas or derived
3. Sentiment Analysis: MapReduce is utilized in sentiment analysis applications to analyze text data from social media, stereotypes remain with Apache Hive.3)Cheaper Option-Another reason why Apache Hive is beneficial is that
customer reviews, and other sources to extract sentiment polarity and identify trends and patterns. it is a comparatively cheaper option. For large organizations, profit is the key. Yet with technologically
advanced tools and softwares that are expensive to operate, profit margins can stoop low.
4. Machine Learning: MapReduce frameworks like Apache Spark MLlib leverage distributed computing to train machine
learning models on large datasets. It enables scalable model training and prediction across distributed clusters.
Architecture of hive
Conclusion: 1)User Interface (UI) – As the name describes User interface provide an interface
MapReduce has emerged as a foundational paradigm in Big Data analytics, offering scalability, fault tolerance, and parallel between user and hive. It enables user to submit queries and other operations to the system. Hive web UI,
processing capabilities for processing and analyzing massive datasets. By understanding the principles and components of Hive command line, and Hive HD Insight (In windows server) are supported by the user interface. 2)Hive
MapReduce, organizations can harness its power to derive valuable insights from their data across various domains and Server – It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to
applications. As the volume of data continues to grow exponentially, MapReduce remains a crucial tool in the arsenal of Big Hive Driver.3)Driver – Queries of the user after the interface are received by the driver within the Hive.
Data analytics technologies. Concept of session handles is implemented by driver. Execution and Fetching of APIs modelled on
JDBC/ODBC interfaces is provided by the user. 4)Compiler – Queries are parses, semantic analysis on the
different query blocks and query expression is done by the compiler. Execution plan with the help of the table
in the database and partition metadata observed from the metastore are generated by the compiler
eventually. 5)Metastore – All the structured data or information of the different tables and partition in the
warehouse containing attributes and attributes level information are stored in the metastore. Sequences or
de-sequences necessary to read and write data and the corresponding HDFS files where the data is stored.
Hive selects corresponding database servers to stock the schema or Metadata of databases, tables,
attributes in a table, data types of databases, and HDFS mapping. 6)Execution Engine – Execution of the
execution plan made by the compiler is performed in the execution engine. The plan is a DAG of stages. The
dependencies within the various stages of the plan is managed by execution engine as well as it executes
these stages on the suitable system components.
What is RDBMS? Characteristics of NoSQL Databases: - NoSQL databases are used in content management systems to store and manage multimedia content, documents, and
metadata.
RDBMS stands for Relational Database Management 1. Schemaless Design: - They offer flexible data models and scalability for handling diverse types of content and supporting collaborative workflows
Systems. It is basically a program that allows us to create, delete, and - NoSQL databases typically have a flexible schema or schema-less design, allowing for dynamic and evolving data in CMS platforms.
update a relational database. A Relational Database is a database system models.
that stores and retrieves data in a tabular format organized in the form of - This flexibility simplifies application development and accommodates changes in data structure over time. 4. Internet of Things (IoT):
rows and columns. It is a smaller subset of DBMS which was designed by - NoSQL databases are employed in IoT applications to store and analyze sensor data, telemetry data, and device
E.F Codd in the 1970s. The major DBMSs like SQL, My-SQL, and ORACLE 2. Scalability: metadata.
are all based on the principles of relational DBMS. Characteristics of - NoSQL databases are designed for horizontal scalability, meaning they can efficiently handle growing datasets by adding - They can handle large volumes of time-series data and provide real-time analytics capabilities for monitoring, alerting, and
RDBMS-1)Data must be stored in tabular form in DB file, that is, it should be more nodes to the cluster. predictive maintenance in IoT systems.
organized in the form of rows and columns.2)Each row of table is called - They use distributed architectures to distribute data across multiple nodes, providing high availability and fault tolerance.
record/tuple . Collection of such records is known as the cardinality of the Conclusion:
table.3)Each column of the table is called an attribute/field. Collection of 3. High Performance:
such columns is called the arity of the table.4)No two records of the DB - NoSQL databases are optimized for specific use cases and data access patterns, providing high performance for read NoSQL databases offer a compelling alternative to traditional relational databases for handling large-scale, distributed, and
table can be same. Data duplicity is therefore avoided by using a candidate and write operations. diverse datasets. With their flexible data models, high scalability, and high performance, NoSQL databases are well-suited for
key. Candidate Key is a minimum set of attributes required to identify each - They often use techniques such as in-memory caching, data partitioning, and optimized storage formats to achieve low a wide range of use cases in web applications, big data analytics, content management systems, IoT, and more. By
record uniquely.5)Tables are related to each other with the help for foreign latency and high throughput. understanding the characteristics, advantages, and use cases of NoSQL databases, organizations can leverage these
keys.Advantages of RDBMS-1)Easy to manage: Each table can be technologies to address the challenges of managing and analyzing data in today's digital age.
independently manipulated without affecting others.2)Security: It is more 4. Eventually Consistent:
secure consisting of multiple levels of security. Access of data shared can - Many NoSQL databases offer eventual consistency guarantees, meaning that data changes are propagated
be limited.3)Flexible: Updating of data can be done at a single point without asynchronously and may take some time to propagate to all nodes.
making amendments at multiple files. Databases can easily be extended to - This relaxed consistency model allows for higher availability and partition tolerance but may result in temporary
incorporate more records, thus providing greater scalability. Also, inconsistencies in data.
facilitates easy application of SQL queries.4)Users: RDBMS supports client
side architecture storing multiple users together.5)Facilitates storage and Advantages of NoSQL Databases:
retrieval of large amount of data.Disadvantages of RDBMS-1)High Cost and
Extensive Hardware and Software Support: Huge costs and setups are 1. Scalability:
required to make these systems functional.2)Scalability: In case of addition - NoSQL databases are highly scalable and can handle large volumes of data and high concurrent loads by distributing
of more data, servers along with additional power, and memory are data across multiple nodes.
required.3)Complexity: Voluminous data creates complexity in - They can scale horizontally by adding more nodes to the cluster, providing linear scalability without significant
understanding of relations and may lower down the performance degradation.
performance.4)Structured Limits: The fields or columns of a relational
database system is enclosed within various limits, which may lead to loss 2. Flexibility:
of data. - NoSQL databases offer flexible data models and schema-less designs, allowing developers to store and query diverse
types of data without rigid schema definitions.
- This flexibility simplifies application development and accommodates changes in data requirements over time.
3. Performance:
NoSQL databases, also known as "Not Only SQL" databases, are a class of database management systems that differ - NoSQL databases are optimized for specific use cases and data access patterns, providing high performance for read
from traditional relational databases in their data model, scalability, and flexibility. NoSQL databases are designed to handle and write operations.
large volumes of unstructured or semi-structured data and to provide high availability and scalability in distributed - They can efficiently handle complex queries, large-scale analytics, and real-time data processing tasks with low latency
environments. In this comprehensive overview, we'll delve into the details of NoSQL databases, covering their types, and high throughput.
characteristics, advantages, and use cases.
4. High Availability:
Types of NoSQL Databases: - NoSQL databases are designed for distributed architectures, providing high availability and fault tolerance by replicating
data across multiple nodes.
1. Document-Based Databases: - They offer built-in mechanisms for data replication, failover, and automatic recovery, ensuring continuous availability even
- Document-based databases store data in flexible, semi-structured formats such as JSON or BSON documents. in the event of node failures or network partitions.
- Each document contains key-value pairs or key-array pairs, allowing for nested structures and complex data models.
- Examples: MongoDB, Couchbase, CouchDB. 5. Cost-Effectiveness:
- NoSQL databases can be more cost-effective than traditional relational databases for large-scale deployments, as they
2. Key-Value Stores: can run on commodity hardware and scale out horizontally.
- Key-value stores are the simplest form of NoSQL databases, storing data as a collection of key-value pairs. - They offer lower total cost of ownership (TCO) by eliminating the need for expensive hardware upgrades and software
- They provide fast access to data based on keys but offer limited query capabilities compared to other NoSQL databases. licenses associated with proprietary databases.
- Examples: Redis, Amazon DynamoDB, Riak.
Use Cases of NoSQL Databases:
3. Column-Family Stores:
- Column-family stores organize data into columns grouped by column families or column families. 1. Web Applications:
- Each row can have a different set of columns, and columns are stored together, allowing for efficient read and write - NoSQL databases are commonly used in web applications for storing user profiles, session data, and content metadata.
operations. - They provide high scalability and performance for handling large volumes of user-generated content, social interactions,
- Examples: Apache Cassandra, HBase, ScyllaDB. and real-time updates.
4. Graph Databases: 2. Big Data Analytics:

- Graph databases represent data as nodes, edges, and properties, allowing for the efficient storage and querying of - NoSQL databases are well-suited for big data analytics applications, where they can efficiently store and process large
interconnected data. volumes of structured, semi-structured, and unstructured data.
- They excel in managing highly interconnected data such as social networks, recommendation engines, and network - They enable real-time analytics, data exploration, and machine learning tasks by providing high performance and
topologies. scalability for data processing.
- Examples: Neo4j, Amazon Neptune, JanusGraph.
3. Content Management Systems (CMS):

Bda 4

Uploaded by

Copyright:

Available Formats

You might also like

Bda 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda 4

Uploaded by

Copyright:

Available Formats

UNIT 4 Information Integration

4. Graph Databases: 2. Big Data Analytics:

You might also like