Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

+

Data migration &


data remediation

for corporate
acquisition in Life
Science industry

About BlueSoft
BlueSoft specializes in bespoke modern IT solutions, focused on building business value together with our customers. One of BlueSoft’s core areas is data
integration and migration. BlueSoft has 20 years of experience with projects delivered for international clients of different scales. Hundreds of qualified specialists
in this domain and a wide range of modern ETL technologies make BlueSoft a solid choice for a vendor for any data project.

Introduction
BlueSoft engineers took up the challenge of global data migration for two international companies recognizable in the Life Science sector.

The project was complex because these two companies processed and stored their data differently, using different tools and approaches to customers, sales
models, services, and regionalization.

Those challenges called for modern and sophisticated set of reliable tools capable of providing secure and efficient data processing, delivering wide range of
external system connectors, allowing secure and reliable data manipulation and cleansing with high level of scalability.

This is why Talend suite supported by AWS RDS services has been chosen as the core ETL toolset for this challenge.

Challenge
BlueSoft consultants faced the challenge of a huge data volume and unknown data quality distributed among multiple sources in an initially unknown manner
(about 1.2 billion records to be processed in each project phase).

Knowledge about the source data shape and quality was limited. For business analysts to access raw data, several checks and metrics had to be built in the early
beginning of the design and analysis process.

The data had to be processed in a safe and efficient manner with extensive monitoring and troubleshooting capabilities, as well as process orchestration and
reporting on every stage of the project. Complex data processing mechanisms (ETL), non-trivial data matching (fuzzy matching), data cleansing and validation,
manual data stewardship, error reporting, and processing automation were all needed as well.

Choosing the proper tool to address this complex task was one of the most important success factors.

A tool with the following capabilities had to be leveraged: M ain systems on Source Side: M ain systems on Target Side:
Processing billion of records / TBs of dat SAP ECC (ERP ORACLE ER
Data security and redundanc SAP C4C (CRM SalesForceDC (CRM
Accessing multiple systems with different interface technologie SmartSolv Magic
Core ETL capabilities easy/quick to set u Topa Market
Out of the box connectors (SFDC, SAP, file, MSSQL etc Marketo Hybris
Data processing orchestration, scheduling, logging, managemen
Custom data manipulation using different programming language
Data matching including fuzzy matching algorithm
Parallel processing and processing optimizatio
Data reporting and cleansing

Solution
To meet those requirements, Talend products supported by AWS RDS were chosen as the main ETL tool after a series of POC sprints. High level tech scope can be
summarized with this graphic:

IRA Con uence


alend Cloud with

J + fl +
T
Share oint
a Talend Remote Amazon Web Services icrosoft SQL Server lic
P

M Q
(for project
k

Engine implementation
(RDS - database (main database (data reporting and test
management, test
(main ETL and process servers in SaaS model) technology) supporting extracts)
management, and
orchestration tool).
documentation)

At first, BlueSoft team acquired core datasets from source systems and prepared a set of detailed data quality reports in order to support Company’s business
analysts in deciding which data is eligible for migration and to be able to formulate detailed requirements.  

The following challenges were addressed during the project’s lifetime.

Connectivit y & throughpu t

Connectivity to multiple systems using different technical means has been established. The majority of the data from all the source systems had to be downloaded
in order to make data analysis, verification, and potential cleansing possible. Every stage of the project required this operation to be repeated due to a variety of
environments and data modifications. Billions of data records (TBytes) have been actively pulled from the sources multiple times and stored in RDS servers.

D ata matchin g

In addition to quite common data transformations, which most modern tools can handle, we ve extensively utilized data matching Talend components and a wide
'

range of matching algorithms to achieve maximum data pairing and merging efficiency. In some case Talend Data Stewardship module came in handy as well.

D evelopment proces s

In parallel to the general ETL process development and testing (multiple trial runs and test cycles), a wide range of error reports has been prepared in order to
identify and bring attention to the most important data problems. The processed data was sensitive in nature, falling under xP and personal data processing G

regulations, which is why in the majority of cases data fixing had to be addressed by business users in the source/target systems instead of the middleware.
Technical data cleansing, such as encoding change, unwanted character removal and data re-formatting took place on the y. Production data was being pulledfl

and dozens of error reports were recalculated and loaded into SharePoint on a weekly basis by Talend.

Thanks to SaaS based architecture of both Talend Cloud and AWS services, the migration team was able to seamlessly scale the solution avoiding bottlenecks and
improve efficiency, while narrowing migration execution time.

P latform scalling in the pro ect j

1 RDS medium server


nitial Talend
I 4 with 5 DB scaled to 4 Team itself scaled
emote ngines
R E R DS large instances from 5 to 2+ 1

scaled to 10 with more than 6 0 people


MSSQL DBs

On top of scalability, which improves the performance and possibilities of the ETL process, significant pressure has been put on parallel processing. Both
leveraging multiple Talend Remote Engines and multi-thread processing in Talend code itself added to the solution’s quality.

Moreover, the project team leveraged custom component creation capabilities to inject Python and Java code in order to optimize tasks executions even more.

The exibility of database setup and replication on RDS combined with easy connectivity and job execution orchestration on Talend Cloud made data separation
fl

between environments straightforward. The main aim was to keep data tidy and independent through all the data migration testing and rehearsal phases.

T alend Components/Services used in the pro ect: j

alend Studio
T T alend emote ngines
R E alend Management Console
T

Talend Stewardship Module alend Academy


T alend Professionals
T

R s e ults
As a result of the two years of the project, a Talend based solution was successfully implemented for this customer by a team including Bluesoft consultants and
employees of merged companies. More than 6 TB of data have been migrated via AWS & Talend cloud, while more than 14 Talend knowledgeable and certified
engineers worked on the migration and analysis of over a billion records database.  

Once this project is finalized teh client plans to leverage Talend to


Improve data quality and perform data cleansing in core system
Automate manual processes concerning verification across system
Build several system integration
Perform next data migrations of different scale

6 TB 1.2 billion 10 2
of cloud stored data records Talend Remote
years

at the moment Engines projects

At BlueSoft, we don’t just code. We understand your business.

each out to us

R
for collaboration with a trusted
partner who supports our IT y

growth and our business goals. y

Łukasz Bober 

Business Unit Director

lukasz.bober@bluesoft.com +48 603 911 131

www.bluesoft.com | Powered by BlueSoft

You might also like