Professional Documents
Culture Documents
Chapter 4 Data Integration
Chapter 4 Data Integration
Chapter 4 Data Integration
The process of consolidating data from multiple applications and creating a unified view of
data assets is known as data integration. As companies store information in different databases,
data integration becomes an important strategy to adopt, as it helps the business users to
integrate data from different sources. For example, an e-commerce company that wants to
extract customer information from multiple data streams or databases, such as marketing, sales,
and finance. In this case, data integration would help to consolidate the data arriving from
various departmental databases, and use it for reporting and analysis.
Data integration is a core component of several different mission-critical data management
projects, such as building an enterprise data warehouse, migrating data from one or multiple
databases to another, and synchronizing data between applications. As a result, there are a
variety of data integration applications, technologies, and techniques used by businesses
to integrate data from disparate sources and create a single version of the truth. Now that you
understand what data integration process is, let’s dive into the different data integration techniques
and technologies.
Basically, Data integration is the combination of technical and business processes used to
combine data from different sources into meaningful and valuable information. Integration
Services can extract and transform data from a wide variety of sources such as XML data files,
flat files, and relational data sources, and then load the data into one or more destinations.
2. Data Collection Latency and Delays: In today’s world, data needs to be processed in
real-time if you want to get accurate and meaningful insights. But if the developers
manually complete the data integration steps, this is just not possible. It will lead to
a delay in data collection. By the time developers collect data from last week, there
will be this week’s left to deal with, and so on. Automated data integration tools
solve this problem effectively. These tools have been developed to collect data in
real-time without letting enterprises waste their valuable resources in the process.
3. Wrong and Multiple Formats: Another of the common challenges of system integration
is the multiple formats of data. The data saved by the finance department will be in a
format that’s different from how and sales teams present their data. Comparing and
combining unstructured data from different formats is neither effective nor useful. An
easy solution to this is to use data transformation tools. These tools analyze the formats
of data and change them to a unified format before adding data to the central database.
Some data integration and business analytics tools already have this as a built-in feature.
This reduces the number of errors you will need to manually check and solve when
collecting data.
4. Lack of Quality Data: We have an abundance of data. But how much of it is even worth
processing? Is all of it useful for the business? What if you process wrong data and
make decisions based on it? These are some challenges of integration that every
organization faces when it starts data integration. Using low-quality data can result in
long-term losses for an enterprise. How can this issue be solved? There’s something
called data quality management that lets you validate data much before it is added to
the warehouse. This saves you from moving unwanted data from its actual location to
the data warehouse. Your database will only house high-quality data that has been
validated as genuine.
5. Numerous Duplicates in Data Pipeline: Having duplicates in the data warehouse will
lead to long-term problems that will impact your business decisions. Hiring data
integration consulting services will help you eliminate data silos by creating a
comprehensive communication channel between the departments. When the employees
share data across the departments, it will naturally reduce the need to create and save
duplicate data. Standardizing validates data will also ensure that the employees know
which data to consider. Investing in technology is vital. But ensuring transparency in
the entire system is equally important.
6. Lack of Understanding of Available Data: What use is data if the employees don’t
understand it or know what to do with it? Not every employee will have the same skills.
That makes it hard for some of them to understand data. For example, the IT department
would be proficient in discussing data using technical terms. The same cannot be said
for employees from the finance or HR departments. They use different terms related to
their fields of expertise. The consulting companies that are into data integration service
offerings help create a common vocabulary that can be used throughout the enterprise.
It’s like a glossary shared with every employee to help them understand what a certain
term or phrase means. This will reduce miscommunication and mistakes caused due to
the wrong understanding of existing data.
7. Existing System Customizations: It’s most likely that your existing systems have
already been customized to suit the specific business needs. Now, bringing more tools
and software can complicate things if they are not compatible with each other. One of
the data integrations features you should invest in is the ability to provide multiple
deployment options. Whether it is on-premises or on the cloud platforms, whether it is
linking with an existing system or building a new one to suit the data-driven model,
data integration services can include ways to combine different systems and bring them
together on the same platform.
9. The Number of Systems and Tools Used in the Enterprise: Most enterprises use
multiple platforms based on the type of software the employees need. The same goes
for systems and tools that are used in different departments. The marketing team relies
on software that’s not used by the HR team. With so many systems to deal with,
gathering data can be a complex task. It needs cooperation from every employee. An
easy way to collect data from multiple systems and tools is by using a pre-configured
integration software that can work with almost any business setup. You don’t need to
invest in different tools to extract data from numerous sources.
10. No Data Security: When anyone asks what all the challenges in data integration are,
you will need to include data security or the lack of it in the list. How many businesses
have been attacked by cybercriminals in recent times? Neither the industry giants nor
small startups have been spared. Data leaks, data breaches, and data corruption can
make the enterprise vulnerable to any kind of cyberattack. And it could be weeks and
months before you recognize it. Data integration services that offer end-to-end solutions
will solve data security problems. They will enhance the security systems in the
business. This ensures that only authorized employees can access the data warehouse
to add, delete, or edit the information stored.
11. Extracting Valuable Insights from Data: A complaint several businesses make is that
they are unable to extract valuable insights after data integration. How can data
integration issues be avoided when there is no proper planning? There’s an effective
solution that enterprises don’t consider before investing in data integration. What’s
that? Planning. We’ve mentioned this in our previous points. SMEs need to know what
they want to achieve before investing in any system. Unless the long-term goals are
clear, you cannot decide the right strategy to achieve the goals. You will need to choose
analytical tools that can be integrated with the data warehouse. This will ensure a
continuous cycle in organizations where data is collected, processed, analyzed, and
reports are generated to help you improve your business.
Characteristics
• Supports variety of data sources
• SQL based API
• Real-time programming model
• location transparency
• Automatic data type conversion services
• Ability to join, union, aggregate, and otherwise correlate data from multiple
sources in a single query
• Ability to create individual views based on data integrated from multiple sources
4.4 Critical ETL components:
Regardless of the exact ETL process you choose, there are some critical components you’ll
want to consider:
• Support for change data capture (CDC): Incremental loading allows you to update
your analytics warehouse with new data without doing a full reload of the entire data
set.
• Auditing and logging: You need detailed logging within the ETL pipeline to ensure
that data can be audited after it’s loaded and that errors can be debugged.
• Handling of multiple source formats: To pull in data from diverse sources such as
Salesforce’s API, your back-end financials application, and databases such as MySQL
and MongoDB, your process needs to be able to handle a variety of data formats.
• Fault tolerance: In any system, problems inevitably occur. ETL systems need to be
able to recover gracefully, making sure that data can make it from one end of the
pipeline to the other even when the first run encounters problems.
• Notification support: If you want your organization to trust its analyses, you have to
build in notification systems to alert you when data isn’t accurate. These might include:
• Proactive notification directly to end users when API credentials expire
• Passing along an error from a third-party API with a description that can help
developers debug and fix an issue
• If there’s an unexpected error in a connector, automatically creating a ticket to
have an engineer look into it
• Utilizing systems-level monitoring for things like errors in networking or
databases
• Low latency: Some decisions need to be made in real time, so data freshness is critical.
While there will be latency constraints imposed by particular source data integrations,
data should flow through your ETL process with as little latency as possible.
• Scalability: As your company grows, so will your data volume. All components of an
ETL process should scale to support arbitrarily large throughput.
• Accuracy: Data cannot be dropped or changed in a way that corrupts its meaning.
Every data point should be auditable at every stage in your process.
4.5 Data Warehousing:
Data warehousing is a process used to collect and manage data from multiple sources
into a centralized repository to drive actionable business insights. With all your data in
one place, it becomes simpler to perform analysis and reporting at different aggregate
levels. It is the core of the BI system and helps you make better business decisions. In
simple words, it is the electronic storage space for all your business data integrated from
different marketing and other sources.
Types of Data Warehousing:
❑ Enterprise Data Warehouse: Enterprise data warehouse is a centralized warehouse
that offers decision-making support to different departments across an enterprise. It
provides a unified approach for organizing as well as classify the data as per the subject.
❑ Operational Data Store: Popularly known as ODS, Operational Data Store is used
when an organization’s reporting. It can be refreshed in real-time, making it best for
routine activities like storing employees’ records.
❑ Data Mart: Data Mart is particularly designed for a specific business line like finance,
accounts, sales, purchases, or inventory. The warehouse allows you to collect data
directly from the sources.
Data Warehouse Appliances
❑ Data Warehouse Appliances are a set of hardware and software tools used for storing
data.
❑ Every data-driven business uses these appliances to build a centralized and
comprehensive data warehouse, where all kinds of functional business data can be
stored.
Figure: DW Appliances
❑ data warehousing is a process of combining data from multiple sources and organizing
it in a way that supports organizational tactical and strategic decision making.
❑ The main purpose of a data warehouse is to provide a transparent picture of the business
at a given point in time.
Business Intelligence (BI) : can be described as a set of tools and methods that facilitate
the transformation of raw data into meaningful patterns to drive useful insights to make
better business decisions.
❑ The process of BI involves data preparation, analytics, and visualization.
❑ BI(Business Intelligence) is a set of processes, architectures, and technologies that
convert raw data into meaningful information that drives profitable business actions.
❑ It is a suite of software and services to transform data into actionable intelligence and
knowledge.
❑ Business Intelligence is an umbrella term used with data analytics. It is a process that
performs data preparation, analytics, and visualization.
❑ Whereas data warehousing describes tools that combine data from disparate sources,
clean the data, and prepare it for analysis.
❑ BI has a direct impact on organization's strategic, tactical and operational business
decisions. BI supports fact-based decision making using historical data rather than
assumptions.
❑ BI tools perform data analysis and create reports, summaries, dashboards, maps,
graphs, and charts to provide users with detailed intelligence about the nature of the
business.
Figure: Business Intelligence
Business intelligence software and systems:
❑ A variety of different types of tools fall under the business intelligence umbrella. The
software selection service breaks down some of the most important categories and
features:
❖ Dashboards
❖ Visualizations
❖ Reporting
❖ Data mining
❖ ETL (extract-transfer-load —tools that import data from one data store into
another)
❖ OLAP (online analytical processing)
1) Manual Data Integration Approach: Manual data integration describes the process of a
person manually collecting the necessary data from different sources by accessing them
directly. The data is cleaned as needed, and stored in a single warehouse. This method of
data integration is extremely inefficient and makes sense only for small organizations with
an absolute minimum of data resources. There is no unified view of the data.
It occurs when a data manager oversees all aspects of the integration — usually by writing
custom code. That means connecting the different data sources, collecting the data, and
cleaning it, etc., without automation.
In this approach, a web-based user interface or an application is created for users of the
system to show all the relevant information by accessing all the source systems directly.
There is no unification of data in reality.
1) Cloud Integration:
❑ Companies who use cloud integration have synchronized data and applications,
improving their ability to operate effectively and nimbly.
❑ Other benefits include:
❖ Improved operational efficiency
❖ Increased flexibility and scalability
❖ Faster time-to-market
❖ Better internal communication
❖ Improved customer service, support, and retention
❖ Increased competitive edge
❖ Reduced operational costs and increased revenue
4.7 SSIS Is a Software Development Platform
❑ SSIS tool helps you to merge data from various data stores
❑ Automates Administrative Functions and Data Loading
❑ Populates Data Marts & Data Warehouses
❑ Helps to clean and standardize data and Building BI into a Data Transformation
Process.
❑ Automating Administrative Functions and Data Loading
❑ SIS contains a GUI that helps users to transform data easily rather than writing large
programs and It can load millions of rows from one data source to another in very few
minutes
❑ Identifying, capturing, and processing data changes
❑ Coordinating data maintenance, processing, or analysis
❑ SSIS eliminates the need of hardcore programmers
❑ SSIS offers robust error and event handling.
❑ Studio Environments
❑ Relevant data integration functions
❑ Effective implementation speed
❑ Tight integration with other Microsoft SQL family
❑ Data Mining Query Transformation
❑ Fuzzy Lookup and Grouping Transformations
❑ Term Extraction and Term Lookup Transformations
❑ Higher speed data connectivity components such as connectivity to SAP or Oracle
1. Control Flow
Control flow is a brain of SSIS package. It helps you to arranges the order of execution for
all its components. The components contain containers and tasks which are managed by
precedence constraints.
2. Precedence Constraints
Precedence constrain are package component which direct tasks to execute in a predefined
order. It also defines the workflow of the entire SSIS package. It controls the execution of
the two linked tasks by executing the destination tasks based on the result of the earlier
task.
3. Data Flow
The main use of the SSIS tool is to extract data into the server's memory, transform it, and
write it to another destination. If Control Flow is the brain, Data Flow is the heart of SSIS.
4.Containers
The container is units for grouping tasks together into units of work. Apart from offering
visual consistency, it also allows to declare variables and event handlers which should be
in the scope of that specific container.
Two types of containers in SSIS are: 1) Sequence Container and 2) Loop Container
1) Sequence Container: allows to organize subsidiary tasks by grouping them.
2) For loop container : which Provides the same functionality as the sequence Container
except that it runs the tasks multiple times.
SSIS Packages:
Another core component of SSIS is the notion of a package. It is a collection of tasks which
execute in an orderly fashion. Here, president constraints help manage the order in which
the task will execute. A package can help you to saves files onto a SQL Server, in the
package catalog database.
Disadvantages of SSIS