Data Flux

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Its been a while since my last post, so I thought Id share something on DataFlux.

So what is DataFlux? Yes, a leader in data quality, its both a company and a product; better stated,
DataFlux (the company) provides a suite of tools (often simply called DataFlux) that provide data
management capabilities, with a focus in data quality.
DataFluxs tools can do a lot of really neat things; Id say its a must-have for Sales & Marketing, and
itd benefit most enterprises out there in other ways. To see what all of this pomp is about, lets use
an example. Think of these entries in your companys xyz system:
Name Address City,State,Zip Phone
Mr. Victor
Johnson
1600 Pennsylvania
Avenue NW
Washington, DC
20500
202-456-
1414
Victor Jonson,
JD
1600 Pennsylvania
Avenue
Washington, DC 456-1414
VICTOR
JOHNSON
255 DONNA WAY SAN LUIS OBISPO,
CA 93405
(805) 555-
1212
Bill Shares 1050 Monterey St SLO, CA 93408 8055444800
Doctor William
Shares
1052 Monterrey
Ave
San Luis Obispo,
California
n/a
william
shares, sr
1001 cass street omaha, nebraska,
68102

In this example, a human could probably pretty easily figure out that the first two Victors are probably
one and the same and that Bill in SLO and William in San Luis Obispo are also the same person. The
other records might be a match, but most of us would agree that we cant be sure based on name
alone. Furthermore, it is obvious that some data inconsistencies exist such as name prefixes and
suffixes, inconsistent casing, incomplete address data, etc.; DataFlux cant (and shouldnt try) to fix all
of these quirks, but it should at least be able to reconcile the differences, and, if we choose, we should
be able to do some data cleanup automatically. So lets get started. Ill open up dfPower Studio.

This interface is new in version 8 and helps provide quick access to the functions one would use most
often. This change is actually helpful (as opposed to some GUI changes made by companies) by
combining a lot of the settings into a central place.
In my case, Ill start Architect by clicking on the icon in the top left, where most design takes place.
On this note I guess I should say that Architect is the single most useful product in the suite(in my
opinion anyway), and its where Ill spend most of my time in this posting.

On the left panel youll see a few categories. Let me explain what youll find each one (skip over this
next section if you want):
Data Inputs Here youll find nodes allowing you to read from ODBC sources, text files, SAS data
sets (DataFlux is a SAS company), and more. Ill cover one other data input later
Data Outputs Similar to inputs, youll find various ways of storing the output of the job.
Utilities Utilities contain what many would refer to as transformations, which might be helpful to
know if youve worked with Informatica or another ETL (Extract, Transfer, Load) tool.
Profiling Most nodes here help provide a synopsis of the data being processed. Another DataFlux
tool is dedicated to profiling in some ways these nodes are a subset of the others functionality, but
theres one primary difference. Here the output of profiling can be linked to other actions.
Quality Heres where some of DataFluxs real magic takes place, so Ill go through the task of
describing each node briefly: Gender Analysis (determine gender based on a name field), identification
analysis (e.g. is this a persons name or an organization name?), parsing (well see this),
standardization (well see one application of this), Change Case (although generally not too
complicated, this gets tricky with certain alphabets), Right Fielding (move data from the wrong field
to the right), Create Scheme (new in Version 8 more of an advanced topic), and Dynamic Scheme
Application (new in Version 8 another advanced topic)
Integration Another area where magic takes place. Well see this in this post.
Enrichment As the name suggests, these nodes help enrich data, i.e. they provide data thats
missing in the input. This section includes: address verification (well see this), geocoding (obtaining
demographic and other information based on an address) and some phone number functions (well
see one example).
Enrichment (Distributed) Provides the same functionality as I just described, but distributed
across servers for performance/reliability gains.
Monitoring Allows for action to take place on a data trigger, e.g. email John if sales fall under
$10K.
Now that weve gone through a quick overview of Architects features, lets use them. Ill first drag my
data source on to the page and double click on it to configure its properties. For my purposes today Ill
read from a delimited text file I created with the data I described at the beginning of the article. I can
use the Suggest button to populate the field names based on the header of the text file.

Whats nice here is I can have auto-preview on (which by the way drives me crazy), or I can turn it off
and press F5 for a refresh, which shows the data only when asked. Either way, the data will appear in
my preview window (instant gratification is one of the great things about Architect).

Next, Ill start out my data quality today by verifying these addresses. I do this by dragging on the
Address Verification (US/Canada) node. After attaching the node to Text File Input 1and double-
clicking on the node, in the input section I map my fields to the ones expected by DataFlux and in
another window I specify what outputs Im interested in. Ive selected a few fields here but there are
many other options available.


Youll notice here Ive passed through only the enriched address fields in the output. I could have also
kept the originals side by side, plus I could have added many more fields to the output, but these will
suffice for now (Itd be tough to fit on the screen here). Already you can see what a difference weve
made. I want to point out just two things here:
1. There is one NOMATCH. This is likely to have happened because too many fields are wrong and
the USPS data verification system is designed not to guess too much
2. 1052 Monterey St is an address I made up and consequently the Zip-4 could not be determined.
The real address for the courthouse in San Luis Obispo is 1050 Monterey St. If I would have used that,
the correct Zip-4 would have been calculated. So why did we get a US_Result_Code of OK? This is
because the USPS system recognizes 1052 as an address within a correct range.
Nonetheless, pretty neat, eh? Id also like to point out that the county name was determined because
I added this output when I configured the properties. At our company weve configured DataFlux to
comply with USPS Publication 28, which among other things, indicates that addresses should always
be uppercased. For this reason you see this here. Having said this, you have the option to propercase
the result set if youd like.
Moving on, lets clean up the names. Itd be nice if we could split the names into a first & last name.
First, I reconfigured the USPS properties to allow additional outputs (the original name & phone
number). Next, I dragged the Parsing node onto the screen and configured its properties to identify
what language & country the text was based on (DataFlux supports several locales and in version 8
supports Unicode). After that, I can preview as before. Note how well DataFlux picked out the first,
middle and last names, not to mention the prefixes and suffixes.

For simplicity, Ill remove the Parse step I just added and use a Standardize node instead. Here in the
properties Ill select a Definition for the name and phone inputs. There are many options to choose
from including things like: Address, Business Title, City, Country, Name, Date, Organization, Phone,
Postal Code, Zip, and several others. Lets see what this does

You might be wondering how DataFlux does this. After all, if the input name were Johnson, Victor
would it have correctly standardized the name to Victor Johnson? The answer here is yes. DataFlux
utilizes several algorithms and known last names, first names, etc. to analyze the structure and
provide a best guess. Of course this means that with very unusual names the parsing algorithm
could make a mistake; nonetheless I think that most users would be surprised how good this
guessing can be, especially with the help of a comma. By that I mean that the placement of a
comma in a name greatly enhances the parser ability to determine the location of the last name. If
youre interested in learning more about this let me know and perhaps Ill write another blog to go
into the details. All in all, its pretty neat stuff and of course the good part is that it is customizable.
This helps if someday you want to write a standardization rule for your companys specific purpose.
Lets move on. Im next going to make Match Codes. Match codes allow duplicate identification (and
resolution). For example, often times (perhaps most of the time), nothing can be done about data in a
system once it is entered. For example if a name is Rob, we cant assume the real name is Robert yet
we may have a burning desire to do something like that to figure out that 1 record is a potential
duplicate of another this is where match codes come in. Heres the section of the Match Codes
Properties window where we assign the incoming fields to the Definition. This step is important
because intelligent parsing, name lookups, etc. occur based on the data type.

Lets preview a match code to see what this does.

I couldnt get the whole output to fit on the screen here, but I think the match codes seen in the name
and the address will get my point across. Here you can see that match codes ignore minor spelling
differences, take into account abbreviations, nicknames, etc. Why is this so significant? We now have
an easy way to find duplicates! Match codes could be stored in a database and allow quick checks for
duplicates! Lets move on to see more Im now going to use Clustering next to see how duplication
identification can be done. First, Ill set the clustering rules in the Properties window (note that I use
the match code instead of the actual value for the rule):

And lets preview

Note that the cluster numbers are the same for records that match, based on the clustering conditions
I set a moment ago. Pay special attention to the fact that our Bill & William Shares didnt match. Why?
Well because of the clustering conditions I set. We could modify our Quality Knowledge Base (QKB) to
indicate that SLO = San Luis Obispo or I could remove the City as a clustering condition, together with
lowering the sensitivity on the address match code (sensitivities range from 50-95) and the two would
match. Lets do this to be sure:

There are a lot of really neat things that DataFlux can do. Ill try to post a thing or two out here now
and again if I see anyone interested

You might also like