Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

How to Build a CDP, DMP, and Data

Lake for AdTech & MarTech


Introduction
Data platforms have been a key part of the programmatic advertising and
digital marketing industries for well over a decade. 

Platforms like customer data platforms (CDPs) and data management platforms
(DMPs) are crucial for helping advertisers and publishers run targeting
advertising campaigns, generate detailed analytics reports, run attribution, and
help them better understand their audiences.

Another key component of data platforms is a data lake, which is a centralized


repository that allows you to store all your structured and unstructured data in
one place. The data collected by a data lake can then be passed to a CDP or DMP
and used to create audiences, among other things. 

In this article, we’ll look at what CDPs, DMPs, and data lakes are, outline
situations where building them makes sense, and provide an overview of how to
build them based on our experience.

1
Table of Contents

Why Should You Build a CDP or DMP? 4

What Is a Customer Data Platform (CDP)? 5

What Is a Data Management Platform (DMP)? 6

What Is a Data Lake? 7

What’s the Difference Between a CDP, DMP, and

a Data Lake? 8

Popular Use Cases of a CDP, DMP, and Data Lake 9


CDP Use Cases

DMP Use Cases

Data Lake Use Cases

What Types of Data Do CDPs, DMPs, and

Data Lakes Collect? 10

How Do CDPs, DMPs, and Data Lakes Collect This Data? 12

Common Technical Challenges and Requirements

When Building a DMP or CDP 13

Common Technical Challenges and Requirements When


Building a Data Lake 15

2
An Example of How to Build a CDP, DMP, and Data Lake 16
Assumptions

The Main Features

An Example of an AWS Architecture Setup for a CDP/DMP with a Data Lake

Request flow

AWS Usage

Cost-Level Analysis of Different AWS Components 19

Important Considerations 20

Looking at Building a CDP, DMP or Data Lake? 21


About Clearcode

Contact Us

3
Why Should You Build a CDP or DMP?
Although there are many CDPs and DMPs on the market, many companies require their own
solution to provide them with control over the collected data, intellectual property, and
feature roadmap.

Here are a couple of situations where building a CDP or DMP makes sense:

1. If you’re an AdTech or MarTech company and are wanting to expand or improve your
tech offering.

2. If you’re a publisher and want to build a walled garden to monetize your first-party data
and allow advertisers to target your audiences.

3. If you’re a company that collects large amounts of data from multiple sources and want
to have ownership of the tech and control over the product and feature roadmap.

4
What Is a Customer Data Platform (CDP)?
A customer data platform (CDP) is a piece of marketing technology that collects and
organizes data from a range of online and offline sources.

CDPs are typically used by marketers to collect all the available data about the customer and
aggregate it into a single database, which is integrated with and accessible from a number of
other marketing systems and platforms used by the company.

With a CDP, marketers can view detailed analytics reports, create user profiles, audiences,
segments, and single customer views, as well as improve advertising and marketing
campaigns by exporting the data to other systems. 

View our infographic below to learn more about the key components of a CDP:

5
What Is a Data Management Platform
(DMP)?
A data management platform (DMP) is a piece of software that collects, stores, and organizes
data collected from a range of sources, such as websites, mobile apps, and advertising
campaigns. Advertisers, agencies, and publishers use a DMP to improve ad targeting,
conduct advanced analytics, look-alike modeling, and audience extension.

View our infographic below to learn more about the key components of a DMP:

6
What Is a Data Lake?
A data lake is a centralized repository that stores structured, semi-structured and
unstructured data, usually in large amounts. Data lakes are often used as a single source of
truth. This means that the data is prepared and stored in a way that ensures it’s correct and
validated. A data lake is also a universal source of normalized, deduplicated, aggregated data
that is used across an entire company and often includes user-access controls. 

Structured data: Data that has been formatted using a schema. Structured data is easily
searchable in relational databases.

Semi-structured data: Data that doesn’t conform with the tablature structure of databases,
but contains organizational properties that allow it to be analyzed. 

Unstructured data: Data that hasn’t been formatted and is in its original state.

Semi-structured or Unstructured and


Structured data
flat data binary  data

Logs, CSV, XLM, and


Audio
JSON data
Video
Emails
Databases Image data
Documents
Natural language
PDFs
Documents
Web pages

Many companies have data science departments or products (like a CDP) that collect data
from different sources, but they require a common source of data. Data collected from these
different data sources often requires additional processing before it can be used for
programmatic advertising or data analysis.

Generally, unaltered or raw-stage data (also known as bronze data) is also available. With this
data-copying approach, we are able to perform additional data verification steps on sampled
or full data sets. Raw stage is also helpful if, for some reason, we need to process the
historical data, which was not entirely transformed.

7
What’s the Difference Between a CDP, DMP,
and a Data Lake? 
CDPs may seem very similar to DMPs, as they are all responsible for collecting and storing
data about customers. There are, however, certain differences in the way they work.

CDPs primarily use first-party data and are based on real consumer identities generated by
collecting and using personally identifiable information (PII). The information comes from
various systems in the organization and can be enriched with third-party data. CDPs are
mainly used by marketers to nurture the existing consumer base.

DMPs, on the other hand, are primarily responsible for aggregating third-party data, which
typically involves the use of cookies. In this way, a DMP is more of an AdTech platform, while
a CDP can be considered a MarTech tool. DMPs are mainly used to enhance advertising
campaigns and acquire lookalike audiences.

A data lake is essentially a system that feeds data into a CDP or DMP.

View the comparison table on the next page to see an overview of the differences between a
CDP, DMP, and a data lake.

8
CDPs DMPs Data lake

Focused on marketing Focused on advertising A centralized


(communicating to a (communicating to repository used for
known audience). unknown audiences). storing large amounts
of structured and
unstructured data,
which are often pushed
to a CDP or DMP and
used for creating user
profiles and audiences.

A CDP typically A DMP typically The data in a data can


leverages first-party leverages third-party consist of first-party,
data, but can be data, with first-party second-party, and
enriched with third- data acting as an third-party data.
party data. additional source of
information.

CDPs primarily use PII DMPs have


and first-party data. traditionally used non-
PII data, such as cookie
IDs and device IDs.

9
Popular Use Cases of a CDP, DMP, and Data
Lake
CDP Use Cases DMP Use Cases Data Lake Use Cases

• Audience creation • Audience creation • Data collection:


and segmentation. and segmentation. Structured and
unstructured data
• Creating a single • Audience targeting. collection from
customer view
(SCV). • Retargeting. multiple sources.

• ID management • Lookalike modeling. • Data integrations:


It makes it easier to
(e.g. ID resolution
and ID graphs).
• ID management integrate new data
(e.g. ID resolution sources.

• Predictive analytics. and ID graphs).


• Analysis: Real-time
• Content and • Audience analysis and reports
intelligence.
product
• Data operations:
recommendations.
• Audience extension. Polling and
processing.

• Security: Access
control to
authorized persons.

• Analytics: Provides
the  possibility of
running analysis
without the need for
data transfers.

• Cataloging and
indexing: It
provides easy to
understand content
via cataloging and
indexing.

10
What Types of Data Do CDPs, DMPs, and
Data Lakes Collect? 
The types of data CDPs, DMPs, and data lakes collect include:

First-Party Data
First-party party data is information gathered straight from a user or customer and is
considered to be the most valuable form of data as the advertiser or publisher has a direct
relationship with the user (e.g. the user has already engaged and interacted with the
advertiser).

11
First-party data is typically collected from:

• Web and mobile analytics tools.


• Customer relationship management (CRM) systems.
• Transactional systems.
Second-Party Data
Second-party data is essentially first-party data from a different company and much less
common than first- or even third-party data. The information is initially collected in the form
of first-party data and then passed on to another advertiser through a partnership agreement,
which then becomes second-party data

Third-Party Data
Third-party data has been used to help advertisers reach and target their desired audience,
however, due to the ever-increasing privacy changes, the availability and usage of third-party
data has been decreasing over the years. 

12
Many publishers and merchants monetize their data by adding third-party trackers to their
websites or tracking SDKs to their apps and passing data about their audiences to data
brokers and DMPs. 

This data can include a user’s browsing history, content interaction, purchases, profile
information entered by the user (e.g. gender or age), GPS geolocation, and much more. 

Based on these data sets, data brokers can create inferred data points about interests,
purchase preferences, income groups, demographics and more.

The data can be further enriched from offline data providers, such as credit card companies,
credit scoring agencies and telcos.

How Do CDPs, DMPs, and Data Lakes Collect


This Data?
The most common ways for CDPs, DMPs, and data lakes to collect data are by: 

• Integrating with other AdTech and MarTech platforms via a server to server connection
or API.

• Adding a tag (aka JavaScript snippet or HTML pixel) to an advertiser or publisher’s


website.

• Importing data from files, e.g. CSV, TSV, and parquet. 

13
Common Technical Challenges and
Requirements When Building a DMP or CDP 
Both CDP and DMP infrastructures are intended to process large amounts of data as the
more data the CDP or DMP can use to build segments, the more valuable it is for its users
(e.g. advertisers, data scientists, publishers, etc.).

However, the larger the scale of data collection, the more complex the infrastructure setup
will be. 

For this reason, we first need to properly assess the scale and amount of data that needs to
be processed as the infrastructure design will be dependent on many different requirements. 

Below are some key requirements that should be taken into account when building a CDP or
DMP:

Data-Source Stream 
A data-source stream is responsible for obtaining data from users/visitors. This data has to
be collected and sent to a tracking server. 

Data sources include:


• Website data. JavaScript code on a website is used to check for browser events. If an
action is undertaken by a visitor, then the JS code creates a payload and sends it into the
tracker component.

• Mobile application data. This often involves using an SDK, which can collect first-party
application data. This data may include user identification data, profile attributes, as
well as user behaviour data. User behaviour events include specific actions inside mobile
apps. Data sent from an SDK is collected by the tracker component.

Data Integration 
There are multiple data sources that can be incorporated into a CDP’s or DMP’s
infrastructure:

• First-party data integration. This includes data collected by a tracker and data from
other platforms.

• Second-party data integration. Data collected via integrations with data vendors (e.g.
credit reporting companies), which can be used to enrich profile information.

14
• Third-party data integration. Typically via third-party trackers, e.g. pixels and scripts on
websites and SDKs in mobile apps.

The Number of Profiles 


Knowing the number of profiles that will be stored in a CDP or DMP is crucial in determining
the database type for profile storage. 

Seeing as the profile database is responsible for identity resolution, which plays a key role in
profile merging, and for proper segment assignment, it is a key component of the CDP’s or
DMP’s infrastructure.

Data Extraction and Discovery


One common use case of a CDP and DMP is to provide an interface for data scientists so they
have a common source of normalized data. 

The cleaned and deduplicated data source is a very valuable input that can be used to
additionally prepare data for machine-learning purposes. This kind of data preparation often
requires you to create a data lake, where data is transformed and encoded to a form that can
be understood by machines. 

There are many types of data transformations, such as: 

• OneHotEncoder
• Hashing
• LeaveOneOut
• Target
• Ordinal (Integer)
• Binary 
Selecting a suitable data transformation type and designing a good data pipeline for
machine-learning involves collaboration between the development team and data scientists
who analyse the data and provide valuable input regarding the machine-learning
requirements. 

Additionally, machine learning may be used to create event-prediction models to produce


clustering and classification jobs, and aggregate and transform data. This can lead to
discovering patterns that may be invisible to a human eye initially, but become quite obvious
after applying a transformation (e.g. a hyperplane transformation).

15
Segments
The types of segments that need to be supported by a CDP’s and DMP’s infrastructure also
influence the infrastructure’s design. 

The following types of segments can include:

• Attribute-based segments (demographic data, location, device type, etc).


• Behavioral segments based on events (e.g. clicking on a link in an email), and their
frequency of actions (e.g. visiting a web page at least three times a month).

• Segments based on classification performed by machine learning:


• Lookalike / affinity: The goal of lookalike/affinity modelling is to support audience
extension. Audience extension can be based on a variety of inputs and be driven by
similar functions. In the end, you can imagine a self-improving loop where we pick
profiles with a lot of conversions and create affinity audiences. This results in an
audience with more conversions, which can be used to create more affinity profiles,
etc.

• Predictive: The goal of predictive targeting is to use available information to predict


the possibility of an interesting event (purchase, app installation, etc.) and to target
only the profiles who have a high prediction rate.

Common Technical Challenges and


Requirements When Building a Data Lake 
Below are some common challenges when building a data lake:

• It’s difficult to combine multiple data sources together to generate any kind of useful
insights and actionable data. Usually, IDs are required to bind the different data sources
together, but often these IDs are not present or simply don’t match.

• It’s often hard to know what data is included in a given data source. Sometimes the data
owner doesn’t even know what kind of data is there. 

• There is also a need to clean up the data and reprocess it in case of an ETL pipeline
failure, which will happen from time to time. This needs to be done either manually or
automatically. Databricks Delta Lake has an automatic solution since their delta tables
comply with ACID properties. AWS is also implementing ACID transactions in one of
their solutions (governed tables), but it’s only available in one region at the moment.

16
In the first step of processing, data is extracted and loaded into the first raw stage. After the
first stage, multiple data lake stages are often available, depending on the use case. 

Usually, the second step carries out various data transformations, like deduplication,
normalisation, column prioritisation, and merging. The following steps perform additional
layers of data transformations, for example, business-level aggregations required for the data
science team or for reporting purposes. 

By incorporating data lake components from AWS, such as Amazon Lake Formation, which
uses a well-known S3 storage mechanism, with Amazon Glue or Amazon EMR for the ETL
data pipeline purpose, we are able to create a centralized, curated, and secured data
repository. 

On top of Amazon Lake Formation, there is a common interface called Amazon Athena that
can be used between multiple infrastructure components, and provides a unified data access
method to Amazon Lake Formation. 

Additionally, by using IAM security methods, an additional layer of proper access-level


controls can be added to the data lake. 

If the data lake is properly designed and created, access to the data can be optimised for
costs. 

Also, thanks to the final aggregate level, we are allowed to perform the required operations
only once during the ETL pipeline when required.

An Example of How to Build a CDP, DMP, and


Data Lake
Below is an example of how to build a CDP/DMP, and data lake. In this example, we’ll
consider the CDP and DMP to be the same thing (i.e. CDP/DMP). 

The information below is based on CDP, DMP, and data lakes that we’ve built in the past. It’s
important to note that the features, architecture setup, and costs will be different for each
project depending on the requirements and the scale of the data needing to be collected. 

We’ve used AWS as the infrastructure, but it can also apply to other infrastructure providers
like Azure and the Google Cloud Platform.

17
Assumptions
We’ve assumed that the data lake will collect first-, second-, and third-party data from
different sources, such as websites, mobile devices, and other AdTech and MarTech
platforms. The data lake will then make this data available to the CDP/DMP. 

The Main Features


• Workloads: The CDP/DMP should include workloads like a data pipeline, identity
resolution, graph mapping, profile & segment database. 

• Profiles: Users (e.g. advertisers, data scientists, publishers, etc.) can build user profiles
to create audiences.

• ID resolution and an ID graph: The CDP/DMP should include an identity resolution


service that creates an ID graph and single customer view (SCV) based on data
collected from multiple sources.  

• Integrations: The CDP/DMP should be able to integrate with other AdTech and
MarTech platforms so the data can be activated, i.e. used for different purposes such as
ad targeting and product recommendations.

An Example of an AWS Architecture Setup for a


CDP/DMP with a Data Lake

Collecting and transforming the data from different sources for a CDP/DMP using a data
lake.

18
Request Flow
Every CDP/DMP system has the following stages:

• Data input: Data coming from external sources (1st-party data, 2nd-party data, 3rd-
party data) and data from trackers (i.e. components that collect data). This data will
generally be stored in a data lake.

• Data processing: Creating taxonomies, segments, and assigning profiles to audiences


according to the rules defined by CDP/DMP users.

• Data output: In most cases, the data output from a CDP/DMP system involves an
audience activation process on external DSPs or ad servers. Sometimes, the audience
data can also be sent directly to different CDP/DMP systems, however, user matching
has to be performed.

AWS Usage
Most CDP/DMP and data lake infrastructure requirements can be covered by existing AWS
components:

• Data lake: Amazon Lake Formation.


• Data onboarding: The Blueprints functionality from Amazon Lake Formation.
• ETL between data lake stages: A combination of Amazon Glue and Amazon EMR. By
using Amazon EMR, we can utilize spot instances to help lower costs. 

• Graph profile database: Amazon Neptune.


• Other profiles and cache databases: Amazon Elasticsearch and Amazon DynamoDB.
By using AWS components, we can save a lot of development and maintenance time. 

Two key functionalities that are used by CDPs/DMPs to support large amounts of data are
Amazon Glue and Amazon Neptune. 

Without these components we would need to maintain our own spark cluster for data
processing and a graph database (like Neo4j), which would mean adding maintenance
resources to the project.

19
Cost-Level Analysis of Different AWS
Components
As we mentioned earlier, the infrastructure is dependent on multiple requirements. The cost
of the infrastructure will vary depending on the requirements and the design itself. 

Taking into account our experience and calculations, the cost of the infrastructure can vary
between $5,000 to $40,000.

For example, if we use AWS Neptune as a profile database, then we’ll need to add around
$2,000-3,000 per month. But that’s only valid if we base it on one instance.

If the CDP/DMP infrastructure requires a tracker, then an additional instance or cluster will
be required. The cost will depend on the traffic requirements.

A data lake requires multiple ETL pipelines to get the data into our infrastructure and
transform it accordingly. Amazon Glue comes in handy for this purpose. 

The cost of the ETL processes will vary and depend on the amount of data that has to be
processed and the frequency requirement. This directly influences the number of Amazon
Glue jobs and their duration. The current pricing of Amazon Glue can be found here. 

Depending on the amount of data that needs to be extracted, transformed and loaded, a
pipeline can last from a few minutes to a few hours, and can even last a few days for the first
data dump.

To save costs, Blueprints, a feature of Amazon Glue, can be used together with bookmarks.
This requires the incoming data to have a defined field, which is monotonically increasing or
decreasing after a change. 

One downside is that you need to create Blueprints services manually as there is no API that
you can use to create one automatically. In previous projects, we’ve used a PySpark script
from the original Blueprint service, but we had to write the code for deploying the Glue
workflow ourselves. 

The decision to use Blueprints will depend on whether it’s a JDBC-based source or S3
source. For JDBC-based sources we recommend using an additional field, called updated at,
which contains an explicit timestamp of a last modification operation.

This field is set as a bookmark and is watched for changes. This way, the next data pipelines
can take less time as they only need to process newly updated data.

The cost of the data lake will depend on the amount of data stored, with Amazon S3 buckets
being a key component of the data lake. 

20
The data is compressed and, in most cases, an optimal data storage format is chosen (e.g.
parquet). S3’s Intelligent-Tiering can be used to cut down costs a bit if we know that the data
will not be frequently used after some period of time. The current pricing of S3 data storage
costs is available here.

Important Considerations
• Be sure to choose the right SLA and guarantees for the different Amazon Web Services
and understand how many events you could lose over a certain period of time in the
event of a failure. 

• Consider the fault tolerance of the components — it’s important to consider this when
setting up an AWS solution, but not many architects or engineers consider this. 

• Take advantage of AWS custom metrics to track issues between AWS components.
• Being aware of the pricing model of a specific service because it is easy to spend a lot of
money for a service that isn’t used properly. 

• Data onboarding — the Blueprints functionality from Amazon Lake Formation may not
support all data source types. In this situation, a custom ETL has to be created. 

21
Trusted AdTech and MarTech Development Partner

About Clearcode
Clearcode is a full-service software development company that specializes in
designing, building, and maintaining custom advertising and marketing
technology.

Since 2009, we’ve been partnering with tech companies to develop RTB,
programmatic, data management, and analytics platforms for all advertising
channels — web, in-app mobile, CTV/OTT, and DOOH.

Clients partner with us because of our experience, domain expertise, and


knowledge of the inner-workings of the programmatic advertising and digital
marketing ecosystems.

Looking at building a CDP, DMP or


data lake?
Get in touch with us via one of the channels below

EMAIL PHONE WEB

sales@clearcode.cc US: (800) 615-0584 clearcode.cc


Europe: +48 71 881 766

22

You might also like