DW Migraiton Method

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 8

The Complete Data Migration

Methodology
Joseph R. Hudicka, Dulcian, Inc.

Overview

Most software implementation efforts are conducted to satisfy one of the following initiatives:

• Deploying a new On Line Transactional Processing (OLTP) system


• Deploying an On Line Analytical Processing (OLAP) system

Each type of system may be replacing and/or enhancing functionality currently delivered by one or more
legacy systems. This sort of systems evolution means that organizations are working to grow at or near the
pace that the ever-changing world of technology dictates. Choosing a new technological direction is
probably the easiest task in accomplishing the entire effort.

Complications arise when we attempt to bring together the information currently maintained by the legacy
system(s) and transform it to fit into the new system. We refer to the building of this bridge between the old
and new systems as data migration. Data migration is a common component most systems development
efforts.

One would think that any two systems that maintain the same sort of data must have been performing very
similar tasks. Therefore, information from one system should map to the other with ease. This is hardly
ever the reality of the situation. Legacy systems have historically proven to be far too lenient with respect
to enforcing integrity at the atomic level of data. Fields that should be populated from a list of valid values,
such as STATES, tend to require that a value be entered, but seldom validate the value entered by the user.

Another common problem has to do with the theoretical design differences between hierarchical legacy
systems and relational systems. Two of the cornerstones of hierarchical systems, namely de-normalization
and redundant storage are strategies that make the relational purist cringe.

The most significant problem with data migration projects is that people really do not understand the
complexity of data transformation until they have undergone a number of arduous migration projects.

Having made these points, it is obvious to this author that there is a desperate need for a sound
methodological approach with which organizations can tackle migration projects. Although there is no way
to avoid unpleasant surprises in data migrations, we can certainly be prepared to confront and resolve them.

The Phases of Data Migration

The ideal data migration project plan should be broken down into phases, which mirror the overall project
development phases. These phases consist of:

• Pre-Strategy
• Strategy
• Pre-Analysis
• Analysis?
• Pre-Design
• Design
• Pre-Test
• Test
• Implementation
• Revise
• Maintenance

Each of these phases will be defined in further detail in the following sections. Some may choose to argue
about the necessity of dividing the migration process into the finite levels of the overall project phases. As
you proceed through this paper, it will become apparent that each of these sub-divisions requires critical
milestones to be achieved. These milestones mark strategic points along the project’s timeline. Any
successes or failures in a given phase will significantly impact the outcome of the entire project.

Pre-Strategy

The Pre-Strategy phase is usually the clearest part of the migration project planning process. In this phase,
the focus of the overall project is determined. It is important to note that data migration projects do not
happen independently. Rather, they are spawned from other development efforts such as the
implementation of new OLTP’s and/or OLAP’s. This is where the first fundamental mistake generally
occurs. The project manager is clearly focused on determining the requirements that the new system must
satisfy, and pays little or no attention to the data migration(s) that must occur. It is quite common to review
a project plan for a new system and discover that data migration is merely listed as a single task item, if at
all. For those working with such project plans, in hindsight, you will discover that a separate project plan
entirely devoted to the data migration effort should have been included.

The focus of the Pre-Strategy phase is to determine the scope of the migration. In other words, answering
the question, ‘What we are trying to migrate?’ This is the time to identify the number of legacy systems
requiring migration and a count of their data structures. Interfaces are another critical factor that should be
identified at this juncture. Interfaces are no different from other data sources, except they may receive data
from the new system, as well as supplying data.

At this point, we are not only identifying the number of data files, but also the number of different systems
from which data will be migrated. In the case of OLAP systems, multiple data sources are a common
occurrence. Data sources are not limited to actual data processing systems. Inevitably, one will find
employees that maintain files on their own workstations that they use to accomplish tasks that cannot be
supported by their existing systems. Word processing documents, spreadsheets, desktop RDBMS packages,
and raw text files are just a few examples of data sources you can expect to uncover in the early phases of
data migration.

Strategy

The Strategy phase of data migration should be scheduled to occur concurrently with the Strategy phase of
the core project (i.e. OLTP, OLAP). Unfortunately, in most cases, the same people are expected to perform
both analyses. This can be done, but these people need a clearly defined set of tasks for which they are
responsible.
The focus of the Strategy Phase is to determine whether or not the objectives documented in the Pre-
Strategy phase are achievable. This involves examining the actual data you plan to migrate. Remember, at
this point in the project, you have no idea if the data is even of high enough quality to consider migrating.
In order to get a better sense, it is helpful to obtain reports that can provide row counts, column counts and
other statistics pertaining to your source data. This kind of information gives you a rough idea of just how
much data there is to migrate. You may find that the overall cost of migration is prohibitive relative to the
quantity of data that needs to be moved.

If this occurs, the most common solution is to migrate the source data to the new platform into data
structures that are constructed identically to that of the source system. Doing so allows you to shut down
the old system and bring up the new one with confidence and without losing historical data. The cost of
developing a set of reports to access the historical data on the new platform tends to be far cheaper than the
cost of migration in nearly every scenario.

The milestone of the Strategy phase is the Data Migration Strategy Document, which outlines the intentions
of the overall migration effort. In this document, we outline the reasons for our conclusions about whether
data migration is worthwhile. The data quality research performed in this phase is still at a very high level,
and in no way suggests that the team has gained a thorough understanding of the specific data cleansing
issues they will face later in the project.

Pre-Analysis

The Strategy phases were geared towards the development of a precise definition of the migration project’s
goals. The Pre-Analysis phase expands on the information gathered in Strategy, and we can determine the
procedures that will result in the migration plan.

By the Pre-Analysis phase, we already understand what needs to be done. Now, we must determine who
will perform these tasks. We also must perform more comprehensive data cleansing research, extending the
knowledge of our findings in the Strategy phase. We built a set of canned data quality ? reports, which can
be useful in the Strategy phase. At this point in the Pre-Analysis phase, real developers must begin rolling
up their sleeves and proceed to write and execute SQL statements against legacy data.

On the first day of the Pre-Analysis phase, someone should be appointed to port the legacy data from its
current environment into an Oracle staging area. A staging area is merely a database account that contains
tables that essentially replicate the data structures of the legacy system. One benefit of porting the
uncleansed legacy data to the Oracle environment is the ability to have your legacy expert staff begin
working with Oracle very early on in the project development stage, thus spreading the learning curve out
as much as possible.

After you have decided upon the legacy data sources and have conducted thorough data research, you must
begin the candidate data element selection process. This involves going through the list of data elements
from each and every source data structure, and deciding whether or not each one must be migrated. You
may begin this process while in the Analysis phase of the project.

Analysis?

The Analysis? phase of data migration also happens in parallel with the Analysis phase of the core project.
This is because each data element identified as a candidate for migration inevitably results in a change to
the data model.
The Analysis phase is not intended to thoroughly identify the transformation rules by which historical data
will be massaged into the new system; rather, it involves making a checklist of the legacy data elements
that we know must be migrated. This list of data elements will have three sources:

1. Legacy data analysis as described above in the Pre-Analysis phase


2. Legacy report audits: A complete set of legacy reports from the old OLTP system is reviewed. A
comprehensive list of field usages should be compiled from these reports.
3. User feedback sessions: These sessions should be scheduled for topics such as data model audits,
reviews of interview notes and the results of the legacy data analysis and legacy report audits.

Pre-Design?

In the later stages of the Pre-Design phase, attribution of the data model will have been completed, and we
will be ready to generate the physical database design. The attribution process will be fed from legacy data
analysis, legacy report audits, and user feedback sessions. This is essentially the start of data mapping.
Mapping must occur to physical data structures because entities do not contain attributes that their
succeeding tables will contain because they are components of foreign keys. The mapping of foreign key
columns is extremely important, and is easily overlooked when attempting to map from physical to logical
structures.

Design

The Design phase is where the bulk of the actual mapping of legacy data elements to columns takes place.
The physical data structures have been frozen, offering an ideal starting point for migration testing. Note
that data migration is iterative – it does not happen in a single sitting.

The mapping portion of a data migration project can be expected to span the Design phase through
Implementation. The reason for this is quite simple. The most important resources for validating the
migration are the users of the new system. Unfortunately, they will be unable to grasp the
comprehensiveness of the migration until they view the data through the new applications. We have
concluded from experience that developing the new reports prior to new forms allows for more thorough
validation of the migration earlier on in the project lifespan. For instance, if some sort of calculation was
performed incorrectly by a migration script, reports will reflect this. A form typically displays a single
master record at a time, whereas reports display several records per page, making them a better means of
displaying the results of migration testing.

A popular misconception about data mapping is that it can be performed against logical data models.
Unfortunately logical data models represent attributes that arise through relationships without defining the
attributes in the child entity. This essentially means that you cannot map any of the connections between
data structures while you are working with the logical design. Therefore, it is necessary to perform data
mapping to the physical data model.

With the physical data structures in place, you can begin the mapping process. Mapping is generally
conducted by a team of at least three people per core business area (i.e. Purchasing, Inventory, Accounts
Receivable, etc.). Of these three people, the first should be a business analyst, generally an end user
possessing intimate knowledge of the historical data to be migrated. The second team member is usually a
systems analyst with knowledge of both the source and target systems. The third person is a
programmer/analyst that performs data research and develops migration routines based upon the mappings
defined by the business analyst and the systems analyst, cooperatively.
Pre-Test/Test/Implementation

Since testing and implementation are practically inseparable, they are often combined into one phase. The
Pre-Test phase deals with two core subject areas: logical errors and physical errors. Physical errors are
typically syntactical in nature and can be easily identified and resolved. Physical errors have nothing to do
with the quality of the mapping effort. Rather, this level of testing deals with semantics of the scripting
language used in the transformation effort.

The Test phase is where we identify and resolve logical errors. The first step is to execute the mapping.
Even if it is successfully completed successfully, we must still ask questions such as:

• How many records did we expect this script to create?


• Did the correct number of records get created? If not, why?
• Has the data been loaded into the correct fields?
• Has the data been formatted correctly?

The truest test of data mapping is providing the populated target data structures to the users that assisted in
the analysis and design of the core system. Invariably, the users will begin to identify scores of other
historical data elements to be migrated that were not apparent to them during the Analysis/Design phases.

The fact is that data mapping often does not make sense to most people until they can physically interact
with the new, populated data structures. Frequently, this is where the majority of transformation and
mapping requirements will be discovered. Most people simply do not realize they have missed something
until it is not there anymore. For this reason, it is critical to unleash users on the populated target data
structures as soon as possible.

The data migration Testing phases must be reached as soon as possible to ensure that it occurs prior to the
Design and Build phases of the core project. Otherwise, months of development effort can be lost as each
additional migration requirement slowly but surely wreaks havoc on the data model, which, in turn,
requires substantive modifications to the applications. .

The measures taken in the Test phases are executed as early as the Design phase. Testing is just as iterative
as the migration project itself, in that every enhancement must pass the test plan.

Revise

The Revise phase is really a superset of the last four phases (Pre-Test, Test, Implement and Maintenance),
which are iterative and do not take place independently. This is the point in the process where cleanup is
managed. All of the data model modifications, transformation rule adjustments, and script modifications
are essentially combined to form the Revise phase.

At this point, the question must be asked: Are both the logical and physical data model being maintained?
If so, you now have doubled the administrative workload for the keeper of the data models. In many
projects, the intention is to maintain continuity between the logical and physical designs. However, because
the overwhelming volume of work tends to exceed its perceived value, the logical model is inevitably
abandoned, resulting in inconsistent system documentation.

CASE tools can be used to maintain the link between the logical and physical models, though it is
necessary for several reports to be developed in house. For example, you will want reports that indicate
discrepancies between entities/tables and attributes/columns. These reports will indicate whether there is a
mis-match between the number of entities versus tables and/or attributes versus columns, identify naming
convention violations and seek out data definition discrepancies. Choose a CASE tool that provides an API
to the meta-data, because it will definitely be needed.

Maintenance

The Maintenance phase is where all of the mappings are validated and successfully implemented in a series
of scripts that have been thoroughly tested. The maintain phase differs depending upon whether you are
migrating to an OLTP or an OLAP system.

If the migration is to an OLTP system, you are working within the ‘one and done’ paradigm. You goal is to
successfully migrate the legacy data into the new system, rendering the migration scripts useless when the
migration has been accomplished.

If the migration is to an OLAP system, you will most likely be reloading the new system in timely
intervals. As new information is recorded in the OLTP system, you will need to transfer it to the OLAP
system. Script performance is a critical issue in OLAP migrations, while OLTP migrations pay little or no
attention to script performance since they will only be run once.

Data Transformation Tool vs. Programmer

How does an organization decide whether to use a data transformation tool to manage its data migration
effort, or simply employ a group of willing and able technical experts to accept the bulk of the workload?

Typically, the decision is made based upon the comparison of the cost of a transformation tool and the cost
of employing a few extra pairs of hands. Can you guess which option a manager that had data migration
listed on the project plan as one task opts for more often?

This is not to say that the purchase of a transformation tool is the only solution. However, I do believe that
many people do not truly understand all of the benefits that a data transformation tool can deliver.

Data Transformation Tool Features

A data transformation tool offers a central repository to support the storage and ongoing maintenance of
data mappings, lending itself to teams of all sizes. The most commonly used alternatives tends to be
spreadsheets, which invariably fail when you need to map one source data element to one or more target
elements. Spreadsheets typically prevent more than one person from making modifications at one time,
resulting in a great deal of unnecessary administrative overhead. Spreadsheets are two-dimensional tools,
and mapping is without question multi-dimensional to the nth degree.

A good data transformation tool should include the following features:

1. Flexible Reporting: Reporting is easily accomplished, and serves as a comprehensive


communication device between the systems and business analysts.
1. Code Generation from Mapping Rule: Systems analysts can generate the migration scripts directly
from the mappings they have defined, reducing the overall programming staff requirements.
2. Script Scheduling Utilities: Some tools provide far more sophisticated options such as script
compilation and scheduling utilities. Scheduling utilities can cut the overall script execution time
of a migration project by more than one third in almost every case, by processing scripts in
parallel as opposed to sequentially.

1. Automatic Integrity Violation Detection: Another important and useful feature of a data
transformation tool is the automatic detection of integrity violations. On one migration project,
this author uncovered 4.5 million data integrity violations. It certainly was not humanly possible to
manually identify all of these violations in months, let alone days. However, I was able to develop
a utility to automatically detect these violations, and had the results within four days.

Manual data migration

The manual approach to data migration is still a valid method. In fact, there are many small data migration
projects that have very little data to migrate, and as a result do not require a major investment in a data
migration tool or large amounts of effort by one or more programmers.

Decision factors in selecting a data transformation tool

Inevitably, a decision about whether to perform data migration manually or purchase a data transformation
tool must be made. Though the cost of a transformation tool is indeed recognized up front, most project
leaders simply cannot foresee the complexity of the data migration aspects of a systems development
project. This is an observation that I have found to be true at several organizations around the country.

A typical data migration project, assuming one legacy data source and a target system of 400 tables
requires a minimum three person team, as described earlier in the paper. With a team of this size, the data
migration will likely take about 6 months. However, with the use of a transformation tool, the overall
project timeline can be cut in half.

Conclusion

Data Migration is a necessary evil, but not an impossible one to conquer. The key is to prepare for it very
early on, and monitor it carefully throughout the process. Project timelines tend to become more rigid as
time passes, so it really makes sense to meet migration head on. A devoted team with a clearly defined
project plan from project inception armed with automated tools where applicable, is indeed the formula to
success.

About the Author

Joseph R. Hudicka is the CEO of Dulcian, Inc (www.dulcian.com ). He is a member of the Oracle8
Customer Advisory Council providing insight regarding future enhancements to the Oracle8 Server
technologies. He is a systems design specialist for Decision Support (OLAP) and Online Transactional
Processing (OLTP) systems with specific business knowledge in finance, health care, pharmaceutical, and
manufacturing areas. Mr. Hudicka is the chief architect of DataMIG™ (Patent Pending), a complete data
migration management tool suite. He delivers presentations regularly at major Oracle conferences and local
Oracle user groups. Joe can be contacted at jhudicka@dulcian.com.

©
1999 Dulcian, Inc.

You might also like