Research Paper SQL

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 79

Contents

I Intro............................................................................................................................................................2
1. What is SQL...................................................................................................................................2
2. What is ETL?.................................................................................................................................2
3. SQL vs Excel...................................................................................................................................3
4. Writing basic code........................................................................................................................3
5. Creating the first query.................................................................................................................3
6. Scheduling a JOB...........................................................................................................................7
7. Types of elements.......................................................................................................................10
8. CONCATENATING........................................................................................................................10
9. RENAMING COLUMNS................................................................................................................11
10. PRACTICE.................................................................................................................................12
II Where, order by, tables...........................................................................................................................14
1. Tables in DW...............................................................................................................................14
2. WHERE clause.............................................................................................................................21
3. Comparison operators................................................................................................................22
4. ORDER BY....................................................................................................................................25
5. USING COMMENTS.....................................................................................................................26
6. Practice.......................................................................................................................................26
III Partitions, aggregate functions, group by, DATES...................................................................................28
1. PARTITION USAGE.......................................................................................................................28
2. DATES.........................................................................................................................................29
3. AGGREGATE QUERIES.................................................................................................................36
4. PRACTICE....................................................................................................................................42
IV Joining tables..........................................................................................................................................43
1. JOINING TABLES..........................................................................................................................43
V Subqueries, wildcards, segments...........................................................................................................51
1. SUBQUERIES...............................................................................................................................51
2. SEGMENTS..................................................................................................................................53
3. WILDCARDS.................................................................................................................................56
4. PRACTICE....................................................................................................................................59
VI Publishers, Decode, Case when, Lower, Upper, Partition by, Coalesce..................................................60

1
1. PUBLISHERS................................................................................................................................60
2. DECODE......................................................................................................................................63
3. CASE WHEN................................................................................................................................64
4. LOWER AND UPPER....................................................................................................................65
5. PARTITION BY and MEDIAN function..........................................................................................65
6. COALESCE...................................................................................................................................67
VII Tips and tricks........................................................................................................................................69
1. HOW TO GET FROM A TO B WHEN WRITING A QUERY..............................................................69
2. WITH CLAUSE..............................................................................................................................70
3. ADDING AN INTERMEDIARY TABLE TO BRIDGE TO OTHER TABLES.............................................73
4. USING SUBQUERIES FOR MULTIPLE CALCULATIONS...................................................................76

I Intro
1. What is SQL
SQL (short for Structured Query Language) is a database management programming language.
Using SQL you can:

a. Build tables to store data


b. Build databases
c. Query (Interrogate) databases and tables – which is what this training will be focusing on,
data extraction. Think of a SQL table as a larger version of an Excel file (thousands of times
larger!). Whereas Excel has a row limit of just over 1M rows, a SQL table can surpass even
100 billion rows, so imagine the diversity of data you can find there.

For the moment, from this you must take away that this training will teach you how to extract
the data you need for reports, analyses and project scoping and from where.

2. What is ETL?
ETL is the integrated platform Amazon uses for SQL. It is more user friendly than SQL Developer
or other programs used to interact with databases and it is customized (and customizable!) for
the needs of Amazonians.
ETL is short for Extract, Transform, Load.
a. Extract is a pretty intuitive term, and it refers to the data extraction. It is the part of ETL
which we will use in the following weeks, and the only one this training will focus on.
b. Transform & Load – this part refers to the table creation side of ETL. Transform is the query
you write, which gathers whatever data you select and then loads into a table you can
access.

2
3. SQL vs Excel
In many cases SQL is just like Excel. Throughout this training you will see how different
commands have an exact Excel counterpart and do, for the most part, the same thing. As long as
you have a general idea what basic commands do, it won’t be hard to catch on with relative ease
.

4. Writing basic code

Imagine you have a folder with multiple excel files (tables). Each of these files would store
different information, organized in columns. Let’s imagine you have an Excel file named “Amazon
FCs1” which has multiple columns like FC name, FC location, FC capacity etc. What do you do if
you want to get that data, and put it in a new file, meaning you want to create a new output
containing data relevant to you?
If you want an Excel sheet or file with just the FC id and the FC name, you go into the file called
“Amazon FCs”, you would SELECT the columns which have the information you need, let’s call
them “Warehouse_id”, and “Name”, FROM the file which holds those columns (“Amazon FCs”)
and paste them in the output you need.

And this is your first SQL query


SELECT
WAREHOUSE_ID,
NAME

FROM D_WAREHOUSES

SELECT & FROM are the two basic requirements of any SQL query. It’s pretty logical that if you
want any data at all, you need to tell the SQL what you want and from which location to get it,
just as you would a real person.

5. Creating the first query


We have already covered the basic requirements in point 4 above, now we will get to see how to
actually write it in ETL and get results.

a. Go to https://datanet.amazon.com, tab ETL Manager and click on Data Feed from the menu on
the left (if you don’t have permissions please see this link).

1
Fulfillment Centers

3
b. Once you click Data Feed, you will get to the Edit Data Feed Profile - New Profile screen, which is
basically an editor and a worksheet for you SQL code
Remember to always name your query in a manner that will make it easy to find. This may not
seem like a big deal now, but when you reach 90 – 100 queries under your login spanning more
than half a year, it will be difficult to remember what each query does, and reading through them
will prove time consuming.

c. Write in your code, which in this case will be SELECT WAREHOUSE_ID, NAME FROM
D_WAREHOUSES. Remember the two requirements of any query, the SELECT and the FROM
keywords.
d. Click save on the bottom right corner.
We’ll take a little break from this to explain what you are seeing now, so we’ll step away from
SQL and head into ETL territory(again, ETL is the UI that Amazon has built for SQL).

4
The Profile section includes the basics of the ETL job – who created it and who modified it last,
and even a history of every change that’s been made (with the option of reverting back to an old
version). It also includes the Profile SQL section, at the bottom, which is where you write or
paste in your SQL. But most importantly, it has the Edit Profile button. Every time you need to
change/ edit your code you’ll have to press Edit Profile.

The Publisher section is where you indicate what publisher type you want to use (if any), the
format, and the details of where to publish, or to whom to publish. A publisher will allow you to
have the data extracted from DW by your query sent to you via email, published on a share drive
or a share point etc. The publisher is very useful when you will rely on daily/weekly output from
a query which could also be quite big (>30MB). Using a publisher can save you lots of time, we
will handle the specifics in a later part of the training.

The Job section is where you indicate the time zone for the job, the priority of the job, what
database to query, what user to use to log into that DB 2, and is where you can define some
wildcards and establish scheduling (which we’ll talk about in future classes).

2
Database

5
No profile can run without a job, so you’ll have to create one for every one of your queries. There
is a bonus here, which is you can add multiple Jobs to the same Profile and, using some other ETL
tricks you can bring more functionality to your data extraction.

Going back to your ETL page, you’ll notice you don’t have a JOB for your Profile, so let’s create
that now. In the Job section of the page, click on ‘NEW’.

This will prompt you to a new page which will require you to fill in some information, so we’ll go
through it one by one. Highlighted below you’ll find the fields you must fill in, along with info on
what to choose. If you fail to update or omit any of them, don’t worry, Datanet will highlight the
missing information.

a. Description – This is already filled in, but you can change it to something which has more
meaning to you.
b. Time zone – I always use the timezone closest to ours, which will be Europe/Paris. You can
choose a time zone from the drop down list
c. Group – This part is about who can download the results of your query, who can edit your query.
Default is personal, and this means that only you can view the results and make changes. I
normally use a group from the drop down list as I think it’s easier to handle changes, particularly
when I want my code to be vetted/ used by someone else. The alternative for this(in case the
group is set for personal) would be to copy the profile, which is a bit more work and may have
other implications, we’ll review them later on

6
d. Priority – believe it or not, you won’t be the only person running a SQL query at any given
moment. There are always tens of thousands (at least!) of queries queued up to run. Define
the priority in relation to the business needs of your query.
e. Partition type – we will talk more about this when we cover the Wildcards. For now just know
that you have to choose a partition (I usually go for Region and choose the value ‘2 – EU’) but
that it will not in any way influence your Profile.
f. Extract Specific Options – Always use the L-DW database. The users which should grant you most
access are APPS_RO, OPS_RO, RETAIL_RO. And for Extract type, when you are querying for data
always use Data Feed.

Notifications – this is not mandatory, but, if you don’t want to check your query every 5 minutes to
see if it has run, you can have ETL send you a notification. The process is pretty straight forward and
self-explanatory. You can also explore the ‘Create New Notification’ button, that will add a lot
functionality which, to me, is only suited for high impact queries (Business Critical or Omega Critical
priority level)

6. Scheduling a JOB

7
I previously briefly mentioned that you can schedule a job to run whenever you want. Normally,
you’d have to start it ad-hoc, but what if you need a new set of data every day/week/month? Then
scheduling is the way to go.

You can schedule the job to run daily, multiple times a week at your choosing, monthly etc, just
select/tick the boxes which agree with your planning and you’re done

Click save in the bottom left corner, and datanet will return you to the previous page where you’ll
see the Profile, the Job and the profile SQL.

You are now ready to run your first query. Just click the run button on the profile job section and that
will prompt your query into running. The query will go through 5 phases which I will describe below.

8
SO you have clicked ‘Run’ and now you are requested to select a run date for your query.
Remember, DW has a 1 day lag3, so you will never have live data. What you will have, with every
query, is a snapshot which is at least 1 day old. This means that, if you choose the current day as run
date, your query will only run once the DW tables have loaded. All information which is live now will
be stored in DW tomorrow. The point is, always choose the run date at least one day into the past.

Now that the query has a run date, it is ready to start assessing your code, the dependencies, the
costs etc and eventually run.

New – The job run is assessing your code, looking at the table names you told it to fetch the
information from, column names etc. If there is a bug in your code, the query will not pass into the
next stages.

Waiting for Dependencies – The job run is looking at the table(s) it needs and connecting to them. If
you are stuck in this stage for more than a 5 minutes this means that one of the tables you need has
not yet loaded for the run date you have chosen, in which case you can either wait for it to load or
delete the Job run an chose an earlier run date.

Waiting for Resources – Your code has been assessed, the dependencies as well and everything
checks out. Right now your job is queued up for running. Whether it runs sooner or later depends on
the size of the DW queue and the priority you have set for your query.

Executing – DW has started fetching the data you requested via your SQL code.

Success/Error – the job run has finished. If it’s successful you can proceed to download the
results, if not…we’ll cover that later.

3
The information for today will only be available tomorrow in DW tables

9
What else can you have in your SELECT clause?

7. Types of elements
There are multiple types of columns you can extract from DW. It may be a column from a table, it
may be a field you need.

 Literal values, such as numbers (10) or text


strings (‘Whatever’), that return exactly
what you enter
 Expressions (aka formulas), such as
REGION_ID * 2, which do math or other
logical procedures
 Function calls, such as
TO_CHAR(REGION_ID) that transform
column information
 Pseudo columns, such as ROWID or
ROWNUM

Your extract would look something like this. At this point you’re probably wondering about the necessity
of column 'THE WAREHOUSE IS CALLED'.

10
8. CONCATENATING
In SQL, just as in Excel, you can concatenate 2 or more values/ columns together. If in Excel you’d use ‘&’
between values or the =CONCATENATE() formula, in SQL you have two bars to mark a concatenation:
‘||’.

'THEWAREHOUSEISCALLED'||NAME
THE WAREHOUSE IS CALLED Amazon 倉庫
THE WAREHOUSE IS CALLED 強靱さが
THE WAREHOUSE IS CALLED Mopure Foods - Swanton
THE WAREHOUSE IS CALLED Paragon - Nevada
THE WAREHOUSE IS CALLED Jifram Extrusions, Inc
THE WAREHOUSE IS CALLED Ferguson Ent - Frostproof #761
THE WAREHOUSE IS CALLED XrSize - Shannon MS
THE WAREHOUSE IS CALLED WGF (Crystalcreek) - WA
THE WAREHOUSE IS CALLED [INACTIVE] SAI - Sedalia
Notice that there is a space between the end of the text string and the second quote. Just as in Excel you
have to add spaces when concatenating (e.g. =CONCATENATE(“THE WAREHOUSE IS CALLED”,”
”,”NAME”), the same must be done in SQL if you want spaces in your concatenated value.

This is an example but let me give you a better example of how concatenating 2 or more fields can be
really helpful in a report.

Now I have the GL name and code information all in one field.

Here’s another example of how concatenating different fields is useful.

REGION_ID || '-' ||
MARKETPLACE_ID
1-1
2-3
2-4
2-5
2-35691
2-44551
It’s a neat trick that I can concatenate any fields (literally any fields: columns, text strings, pseudo
columns etc) but from the looks of it, the more fields I concatenate, the longer the column name is going
to be.

11
9. RENAMING COLUMNS
SQL has this function which allows you to change the name of the column in your extract. This means you
are not limited to column names such as MARKETPLACE_ID or ITEM_NAME. This is a good opportunity to
review the elements you can have in a SELECT clause.

There is more than one way to rename a column. The most


intuitive way is to add “as” after the column name
followed by the column alias. But, as you can see in the
case of marketplace_id, “as” is not mandatory, you can
just write the column alias after the current column
name. You can even give your column a more complex
name, formed from more than one word, as I did for the
gl_product_group column, but remember, you must
always enclose the alias in double quotes!!!

My personal favorite is the example from the last row in


the select clause, which is to separate the words in your column alias using the underscore (‘_’).

This is how your export will look. While these are not the most intuitive examples of why it is good to use
a column alias, the column alias’ relevance will become obvious while we get to more complex code; but
let me give you an example first.

This is called a case when function, it is the equivalent of a ‘=IF’ formula in Excel, and yes, we will cover
this later on in our training. Notice that this ‘case when’ function is renamed ‘EXCLUSION_REASON’, and
all 12 rows of that function represent one column. Without this column alias, column name would be the
exact piece of code, all 12 rows of it, which is not only unusable, but it tells a person who is not familiar
with this query nothing about the data it pulls in this column.

So take this as advice, try to have the end client in mind when you build your queries, and make the data
as easy to understand as possible, it will save both him and you a lot of time!

12
10. PRACTICE
1. Build a query which pulls a list of all organizational units id in Amazon. The table you need to
query is o_warehouses. Run the job and see the results.

What you see in your results is that you get the same values over and over, why is that?

When you’re asking DW to fetch the organizational unit ids from the o_warehouses table (or fetch any
information from any table for that matter) it’s going to give you every instance of the organizational unit
id in that table.

Use ‘Distinct’ to make sure your query returns only unique rows. Consider ‘Distinct’ the ‘Remove
duplicates’ you have in Excel. You will use it after ‘SELECT’ and before the first column name.

2. Add WAREHOUSE_ID, NAME, CAN_SHIP_INTERNALLY + a calculation column, a blank column and


a literal expression in your SELECT clause along.
3. Concatenate 3 of your SELECT clause elements to form a new column. Choose relevant fields so
that your new column contains relevant information.
4. Rename the column however you see fit, and rename 2 other columns, use a different method
each time.
5. Make the query you just built run every Sunday, Tuesday and Thursday and notify you on
success/error, with a different message for each outcome.

13
II Where, order by, tables

1. Tables in DW
Now you know the basics of any ETL query, the SELECT and the FROM statements. But the tables used in
the examples in Chapter 1 are not the only ones in DW. How do you know which tables(s) to interrogate
in order to get the information you need? Easy, you go to BI metadata where you can search for the
columns and tables.

Let’s do a short overview of what you’ll find there. There are 5 tabs in BI Metadata (BI stands for Business
Intelligence).

Home is the overview/welcoming page. You’ll notice a search bar at the top where you can search for
everything from columns, tables to BI questions.

Business questions – picture this is an FAQ section where users ask questions regarding the information
in DW tables and other users respond. When you’re stuck on a query or you’re having problems
understanding why the table is structured the way it is or you’re not sure what information that table
holds, this is a good place to search for answers.

14
Tables

This is an overview of all DW tables. There are some things worth pointing out here:

a. Tab ‘Table Owner’. There are 2 kind of tables – Booker tables (the ones you can use almost all
the time with just the standard permissions) and Virt tables.

- Booker tables are for the most part public. These are tables managed by the BI team, they
usually have one of 3 prefixes – D_, O_ or A_. D_ tables are the ones you want to use , D stands
for denormalized which means that the table is compiled from multiple other tables. O_ stands
for original tables. A_ tables are the ones you want to avoid as they’re generally built for
reporting purposes and may not contain full sets of data.
- Virt tables are private tables. These are usually created to suit the needs of teams in terms of
reporting, or to ease the strain on queries which are too heavy to run (we’ll cover this in the tips
and tricks section towards the end of this training). A good example of data going into a VIRT is
Vendor Central data.

The takeaway is that you should try to use D and O tables as often as possible.

b. Tab ‘Description’ – here you’ll find (at times) a description of the table; who owns it, what you
should expect to find
c. Tab ‘Partition Keys’ show information about how to optimize your code (use of partitions which
we’ll cover in detail in Session 3).

15
d. Tab ‘Retention Policy’ tells you for how long information is stored in that table. Think of this as a
lifecycle for data. For a table where the retention policy is 15 months, it means you will not find
any records older than 15 months in that table. Basically everyday rows of data are being deleted
because they have ‘expired’.
e. Filter search box – on the top right – most important to use as it is very efficient when browsing
columns especially. You can filter down to a keyword which matches the table name.
Unfortunately, advanced filters do not work in BI metadata, searches here will return exact
matches (e.g. if I search for ASIN I will find all tables with ASIN in the name, if I search for
ASIN_MARKETPLACE I will find all tables which contain this syntax in the name).

My instant data – this is the holy grail of BI. That’s because you can get a 1 row instant sample of the
information in that table using parameters specified by you. This becomes even more useful when you
have to do ASIN level deep dives.

Just search for a table, BI will autocomplete the table name.

Now fill in the required fields and voila, you’ll have all the information that table has for that row.

Search – this is an advanced search for tables/columns.

16
Using the Advanced Search field

This advanced search will save you a lot of time and effort. From the very start you can eliminate Users’
tables (virt) from your results – which we already know we can’t query without special permissions, you
can choose whether your input in the search box should apply to Table/Column names only or also
descriptions etc.

Table overview

Ok, so now we know how to search for the tables and columns we need. Let’s go and explore one table
at a time and see what’s there. To enter one table just search for one as you just learned above and click
on it. That will take you to a page just like the one below. For this example I chose a table I use frequently
– D_MP_ASINS

17
You’ll see some common information such as the Description, the Partition Information. What’s cool is
that you have a special section called Associated Business Questions. I urge you to read that part (or
browse through it) before using that table, you may find very useful tips there.

Equally useful is the ‘Show Columns and Sample Data’. Click on it and it will take you to a view of all the
columns, where you can see the data type (e.g. NUMBER, CHAR etc), a description, but the real treat is
the Sample Data button at the bottom of the page.

That’s going to show you live output (unless the content is blocked for security reasons) and you can
better understand how to use every particular column.

Types of data

18
When querying DW you will most commonly find 4 types of data. Later on, when we will talk about dates
and using multiple DW tables in a single query, the importance of these data types will become even
clearer.

CHAR (followed by a number – CHAR(10)) – this data type is of fixed length. It is used in DW for fields
such as ASIN or EAN, where the number of characters will always be the same. This data type uses both
letters and numbers

VARCHAR – it is the same as CHAR data type, only that it has no fixed number of characters. This can be
used for almost any data, from customer reviews to customer_ids.

NUMBER – unlike CHAR AND VARCHAR, which use the string format and can hold alpha-numeric values,
NUMBER will only have…well…numbers. You can do calculations on this field, such as
additions/subtractions/averages etc.

DATE/DATE TIME – this data type is used solely for time stamps. Again, it will become more obvious how
this is helpful once we go further into this training.

Exercise
Now that you know all these, find a Booker tables from which you can extract the ASIN, the
MARKETPLACE, the REGION and the ITEM NAME. Specify the name of the table you find, the ~ number of
rows it has, what the table prefix means.

Now build the query to include the above elements, and make sure to only get unique rows, and before
you run it, we’re going to talk about the explain plan.

So, you’re done building your query, you have built a profile for it, and you are ready to run your profile.
What if you have a bug in your query? Might not make sense now but as your queries will get more and
more complex, it will be pretty easy to miss a comma, a keyword, miss-spell a column name etc. If you
click run for a broken piece of code, it will error out within minutes.

Let’s validate the code. Click the Explain and Analyze Query button, and after it, choose the option
Explain and Analyze.

For the sake of this exercise you’ll run the explain plan after you will have saved the query . You can also
run the explain plan while you are editing the SQL code, which is the best option since you can make
changes and correct whatever is wrong without having to save and reopen the SQL Profile code.

19
Generally, once the Explain Plan has run, this is what you want to see:

‘This statement conforms to all rules’ is the Holy Grail, it means you can run your code as it is. When you
run an explain plan, there are 3 fields you want to keep an eye on (highlighted above), which are:

- Rows – it will give you a rough estimate of how many rows your query will scan to retrieve the
information you need.
- Bytes – shows you the temp space your code is using.
- Cost – an estimate of the strain your code will have on datanet. Generally I found that it is linked
to ‘Rows’ and ‘Bytes’ and that it’s best you keep it under 1.000.000.

Please remember that the above indicators are on the spot rough estimates of the cost your query will
have and should be treated as such!

So you’ve run your explain plan and your explain plan is not the same as the one above. It actually looks
more like this

20
‘The Query Analyzer found issues with your query/queries.’ is a clear indicator that something’s wrong.
The number of Rows, the Bytes allocation and the Cost have sky rocketed! Plus you also see that this
current code will return > 10 million rows. As it is, even if it does run (which it probably will since it’s a
pretty simple query) you couldn’t use any of the results.

How can you limit the results?

2. WHERE clause
The WHERE clause is the Excel filter of SQL. If I were to ask you to find me “red pen” somewhere in the
office area you’d spend a lot of time looking for pens. By SQL logic you’d return with all the red pens on
the floor, I’d have to sort through the pens to find the one I wanted. But if I tell you to get me the red
pen(s) on your colleague’s desk that would save us bot a lot of trouble. WHERE works much the same
way.

So, what can we add to the WHERE clause to reduce the results? The option are rather wide, but most
times you will be limiting your results to specific column values.

Let’s start with the query above, for which we saw the huge costs. Let’s limit the results to the EU region
(2) and the ES marketplace (44551). We can list any number of restricting conditions.

The WHERE is placed after the SELECT and FROM statements!!!

21
We still see that the rows returned are > 10 million (it’s understandable, you’re asking for all the ASINs –
retail or 3P – which are in the Amazon Catalog in DE), but the costs (rows scanned, bytes used) are a lot
lower. So adding these 2 filters already improved the query.

Remember!!! The multiple conditions you use in your WHERE clause have to be separated by ‘AND’ , as
opposed to the items in your SELECT clause, which are separated by a coma - ‘,‘!!!

Think of AND in SQL as the ‘=AND(statement 1, statement2)’ of Excel. You can also use OR, and it too
works exactly like the OR in Excel.

This example is not the best, but it’s just here to illustrate a point.
What it says is that either the Region is EU OR the Marketplace is
ES, then get the information requested in the SELECT clause. It is
an example of a poor filter because it will pull every record in the
EU region, regardless of the Marketplace.
And, just as in Excel, you can combine the 2. Let’s say you want to
pull the records in Region 2 (EU), when the marketplace is either
IT or ES.

You’ll be tempted to write it


this way, but let’s think this
through. You Excel formula would look something like this

22
=AND(REGION_ID=2,OR(MARKETPLACE_ID =44551,MARKETPLACE_ID =35691)). See how the OR part is
‘nested’ inside the AND part of the formula?

To achieve this in SQL, you need to ‘nest’ the OR statement, just like in Excel. The WHERE clause will look
like this:

What this does is delimitate 2 conditions in your WHERE clause, either the region is EU or the
marketplace is either ES or IT.
We’ll talk later on about the importance of ‘nesting’ statements.

3. Comparison operators
1. ‘=’

In the examples above I used ‘=’. This can be used for both numeric values (e.g. marketplace_ID) or
char/varchar2 values (e.g. ASIN). The difference is that char/varchar2 values will always be used inside
single quotes!

In Excel you can build formulas to apply for certain values you
specify – as detailed above, but you can also apply a formula for
every value except for the ones you specifically label –
=AND(REGION_ID = 2, MARKETPLACE_ID <> 44551). You can also
use ‘<>’ in SQL, or you have another option, the ‘!=’.

Write a WHERE clause in which the region_id is 2 and the marketplace_id is not 35691 (IT).

OR

2. However, for char or varchar2 formats (basically for string formats) it’s better to use LIKE. Again,
for non-numeric values remember to use single quotes, otherwise you’ll get the ‘INVALID
IDENTIFIER’ error.

I used this code, which have given me the above error. On the right side is the correct code.

23
What LIKE has over ‘=’ is the ability to match parts of a value as opposed to the entire value. This can be
achieved using ‘pattern matching characters’, the ‘%’ and the ‘_’.

‘%’ replaces any number of


characters. For example, if I
wanted to find all ASINs
starting in B00 and ending in
4A, I would use this:

‘_’ replaces only one character, and just as ‘%’ you can place it
anywhere in the string you are trying to match.

What would the query with the current filter pull?

For both ‘%’ and ‘_’, you can use them multiple times in the same string, e.g. ‘B00_0000__’

3. IN operator

What if you want to query for all the ASINs which are in region 2, in the marketplaces IT, ES, FR? Can you
do that using ‘OR’? Yes, but it’s a bit like squashing water. Instead you can use the IN operator and the
benefits will become more obvious once you query for ASIN/subcategory/GL lists.

Below is the same thing but using OR instead of IN.

It’s important to point out that you’ll still need to use single quotes with the IN operator if you query for
a char/varchar2 item list (e.g. ASINs, EANs), like below.

You can also query for records which are not in a list by putting NOT ahead of IN.

24
4. ‘<’, ‘>’ operators

These operators are to be used just like ‘=’. They can be applied to numeric fields (like marketplace_id)
but also to char/varchar2 fields (ASIN) where ‘>’ will mean the next value in alphabetical order will get
picked up.

Which marketplaces will get picked up by this code?

What if you also want to include Marketplace 4 in the results?

You can also use >= or <=

5. BETWEEN operator

BETWEEN does basically the same job as >= and <=, but it has the added value that you don’t need to
write the column name twice. It will also become a lot more obvious why this is an important operator
later on, when you will work with dates.

You use the BETWEEN operator as described in the picture on


the left, meaning you place it after the field you want this
operator to apply to.

After BETWEEN you have to list the marginal values of your


interval, delimited by AND. In this case I want to see all the
ASINs which have marketplace_id = 2 (CA) or = 3 (UK) or = 4
(DE) or = 5 (FR).

4. ORDER BY
ORDER BY is to SQL what SORT is to Excel. From a syntax
standpoint, ORDER BY is placed after WHERE!!!

The results will be sorted from smallest to largest (for NUMBER)


or from A to Z (CHAR/VARCHAR2 ) by default.

25
If you want the results to be ordered from Z to A just add DESC after the criteria for ordering

You know how in Excel you have the advance sort, where you give more than one criteria for sorting your
data? Of course you also have this here, only that it’s easier to use that in Excel 

The columns/ information you choose to sort your data by


have to be delimited by a coma, just like the in the SELECT
statements.

You’ll note that I wrote ASC after MARKETPLACE_ID. I could


not have written ASC and the result would have been the
same.

Later on, when we’ll talk about the DECODE statement, you’ll see how to sort by custom values you
choose.

5. USING COMMENTS
Right now you know how to write basic code. It is pretty easy to read the code you have until now, I
don’t think you can be in doubt what putting MARKETPLACE_ID, ASIN and REGION_ID in the select might
mean. But, I stress this out again, your queries will grow in complexity. If you will be reviewing a complex
query 2 or 3 months after you had written it, it will take you some time to figure out what everything
does, what the logic is, why it is written the way it is. This is what comments are for, you can make
comments throughout your query to signal out the purpose of different bits of information.

The comments are not considered by SQL when running your code, they’re there only to help you.

26
There are 2 ways of commenting in SQL.

a. Adding “--“ in front of your comment . SQL will disregard all information after “--“ as long as it is
on the same row – check the comment in the WHERE clause after the Marketplace_id filter
b. Enclosing the part you want to comment in /* and */ . This is useful when you have a bigger
portion of code to comment – check last 3 rows of the query

But there is another use for the comments. Look at the SELECT clause, what do you notice?

I have used the comments to eliminate the gl_product_group column from my select clause, but I did not
delete it. I did this because I believed it was important for me to know that I also used to pull the
gl_product_group column.

Basically, using comments can also act as a method to preserve previous iterations of my code.

6. Practice
1. Find a table which has the ASIN, subcategory information, both subcategory name and
subcategory code, the marketplace, the region, the GL code. Make sure it’s a BOOKER table.
2. Write a SQL query to extract the fields above only for only for Germany (in the Marketplace_id
column description you’ll find a wiki with all the MARKETPALCE_IDs and what country they
represent).
3. Only pull unique rows.
4. Limit the results to GLs 60, 75 and 79, and write this filter in your query in 2 different ways, using
different relational operators, and in one of the cases use ‘OR’.
5. Rename 3 of the columns.
6. Add a comment to the GL filter in the WHERE clause and mention the GL name for every
GL_code you used (useful wiki with all GL codes and names here).

27
7. Order the results by GL and MARKETPLACE from smallest to largest and Subcategory code from
largest to smallest.
8. Validate the SQL plan.
9. Run your query to pull results only for the following ASINs: B00TFGWAA8,
B00UNA1O0W ,B00TFLQ2V6 ,B00ZPEAFXI.
10. Run an explain plan and run your query.
11. Explain the blank rows outcome.

Useful information

Wiki link of all Marketplace_id values - here

28
III Partitions, aggregate functions, group by, DATES

Recap exercises

1. What are the 4 data types you can find in a DW table?


2. Give a column example for each data type.
3. What are the 4 clauses we have learned so far?
4. Put those clauses in order.
5. What do you need in order to run an already written query?
6. Name the differences between a query with the Job Profile Group set to personal and to the Job
Profile Group set to a group.
7. What is the difference between CHAR and VARCHAR2?
8. How many types of table owners are there and name them.
9. What does ‘my instant data’ do?
10. Name all the comparison operators.
11. Use 3 comparison operators in an example.
12. Use the remaining 2 in an example.
13. Which are the 5 stages of running a query?
14. Give examples of 2 types of elements and write them down.
15. Concatenate any 2 columns you have used in this training so far, separate them by a space.
16. How do I order a query’s output from smallest to largest and largest to smallest?
17. How can you insert a comment in a query?
18. Use “Advanced search” to only see Booker tables.
19. What is the retention policy of a table?
20. Check the sample output of a DW table.

1. PARTITION USAGE
We talked about filters and how to use them, how to use relational operators to limit the results. You’re
queries will reach a level of complexity where you’ll want them written as efficiently as possible. Limiting
the results only to the field values you need may not be enough, you will need to take partitions into
consideration.

Range Partitioning makes queries against large tables much more efficient. A table that is range
partitioned by one or more column is virtually (and even physically via disks) split into chunks called
partitions – one for each distinct value or range of values in that table. For example, if the table
D_MP_ASINS was partitioned by the column REGION_ID, it would be virtually split into three partitions –
one that includes all the records for REGION 1, one for REGION 2, one for REGION 3 etc. A query against
the table with REGION_ID = 3 in its WHERE clause would look only at the portion of records associated
with REGION 3, and not bother scanning the other two partition chunks – thus making it at least 3 times
more efficient. If a table is not partitioned, each row of the table is checked against the conditions in the

29
WHERE clause, one by one. If a table is partitioned, only the rows in the partition that match the WHERE
clause conditions on the partitioned columns are scanned, thus saving time and resources.

Think of the DW table as a normal Excel table. Now imagine that for every partition you see in a DW
table, you have a separate sheet in your Excel file.

Remember, you can see the partition information in BI metadata, at table details.

For now, this is enough knowledge about partitions, but we’ll talk about them more in the coming weeks.

2. DATES

While working with Data Warehouse tables, you’ll find two types of DATE columns: DATE columns that
are truncated to only the Month, Day, and Year information (e.g. 12/31/2008), and DATE columns that
also contain the Hour, Minute, and Seconds (e.g. 12/31/2008 08:13:52) – known as the DATETIME
format.

I have never found an instance where I needed DATE information down to the second however I have
seen tables where the only time reference table was DATETIME. DATETIME requires a special syntax,
we’ll cover that later in this chapter. So we will focus on the DATE column type. Why is the DATE data
type so important, what exactly can you use it for?

A timeframe gives meaning to any report. If I want to analyze the sales product ‘X’ had during peak
season (e.g. 2 weeks before Christmas) relative to the rest of the rest of the year, I will need timeframe
information for the records I pull from DW.

How do you identify DATE type columns in DW? The same way you check column information and table
contents, you use BI-metadata information.

30
And in sample data you can see exactly what the column contents look like.

In the case of Activity_day, the information stops at day level, with no further details on hour/ min/
second. That’s because the format is DATE and not DATE TIME.

What exactly can you do with the DATE besides just referencing it? We’ll turn a bit to data conversion for
this examples.

The TO_CHAR() Function with Dates


There are many ways to write a date, from the US standard of 03/31/2009 to the UK standard of
31/03/2009, writing them as March 31st, 2009, or combinations of words and numbers, like 31-MAR-09.
Some of these formats can be very precise, while others are less so. For example, if a Book was
published on 31-MAR-09, do we know if it was published in 2009 or 1909? Unfortunately, we don’t, and
programs like Excel may make assumptions that could be wrong.

When writing SQL queries, you may find you want to control the format of a date column in your results,
so you always know what format it will be in and so there is never any question of exactly what the date
means. To do this, we use the TO_CHAR() function, which converts the DATE to a character string , in a
format specified by you.

31
Let’s pay a little attention at the syntax used. Notice that I pulled the column “ACTIVITY_DAY” and then I
got several conversions/bits of information from it. I converted the data type from DATE to CHAR.
Whenever you convert one data type to another, always reference the new data type at the beginning of
the row. After you mention ‘TO_CHAR’ you have to enclose in parentheses the name of the column you
are converting, followed by a comma, and then, within single quotes, the format you will be converting
the data into.

See column ‘ACTIVITY_DAY’ for the format the datanet table I have queried uses to store the DATE
information. It goes day, month (as abbreviation) and then year (as the last 2 digits of the year). In this
format, we can’t really be sure what the year is, could be 2016, could be 1916 etc. Using TO_CHAR (see
the SELECT above) I have made the date easier to read (,TO_CHAR(ACTIVITY_DAY,'MM/DD-YYYY')), I have
pulled the number of the day (,TO_CHAR(ACTIVITY_DAY,'D')) and what the day actually was
(,TO_CHAR(ACTIVITY_DAY,'DAY')).

This is pretty neat but there were more elements in the above SELECT, which allow for even more
customization of the DATE.

32
I have the year written in letters (,TO_CHAR(ACTIVITY_DAY,'YEAR')), the year written in letters and with a
numeric reference (,TO_CHAR(ACTIVITY_DAY,'YEAR-YYYY')) and more complex construct at which we’ll
look separately.

TO_CHAR(ACTIVITY_DAY,'DAY "the" DDD"th day of" YYYY "the" CC"th CENTURY"')

What I have done here is to add bits of string format. Notice that the syntax is almost the same, start
with TO_CHAR, the format I want to pull is still enclosed in single quotes, but, the string bits I have used,
which are not recognized DATE elements (like D or YEAR or YYYY or MM or MONTH etc are) have to be
enclosed in double quotes. Sure it looks like overkill going into such a great level of detail, but further
along, a trick like this will spare you some extra Excel work processing the DW output.

The TRUNC() Function with Dates


Another type of conversion you can do to a DATE field is to truncate the date using the TRUNC() function.
TRUN() is used much like TO_CHAR, but instead of translating the DATE field into a character string, it
truncates it to the level you specify, but leaves it in a DATE format.

Why is it worth mentioning that after using TRUNC we’d still end
up with a DATE format? Because you can do calculations on a
DATE format, compare dates, use comparison operators.

Below is the result from the above SELECT statement.

33
The syntax is the same as with TO_CHAR, so let’s go through what each TRUNC function does.

- DDD will give you the current day from which the data was taken.
- D will give you the first day of the week from which the data was taken.
- Y will give you the first day of the year.
- MM will give you the first day of the month.
- Q, the first day of the quarter.
- CC, the first day of the century.

The TO_DATE() Function


One frequent use of DATE columns, besides returning them in your results, is to use them in your WHERE
clause to limit your results. In fact, DATE columns are commonly used as partitions on tables, so this use
is very common. A function called TO_DATE() comes in handy when working with DATE columns in your
WHERE clause. It’s essentially the opposite of the TO_CHAR() function – turning a character string into a
DATE format.

I was already using TO_DATE in my


where clause to limit the results and
the area of search for my query.

TO_DATE has this syntax when used


in the WHERE statement:
TO_DATE(‘date you want to use’ (in
single quotes), comma, ‘format for
the date you just entered’). It is very
important to keep an eye out for
this, because it’s easy to make a
mistake. For example, if I write
ACTIVITY_DAY = TO_DATE(‘2016-02-
13’,’YYYY-DD-MM’) I will get an
error like “ORA-01843: not a valid
month”.

What my filter tells DW is to basically look for all records associated with the 2nd day of the 13th month,
of course it will error out.

Comparison operators with DATES


a. BETWEEN

Now we’re getting back to the beginning of this chapter where we needed to figure out what sales an
ASIN had 2 weeks before Christmas. We’re going to go with OPS (ordered product sales) – we’ll cover
economic indicators later one when we get to the project/initiatives monetization parts.

34
The syntax is pretty self-explanatory. You’re giving SQL the outer limits of the time interval you input, it
does not matter in which order you write your border values of the selected time frame.

BETWEEN comes in handy when having to query a DATE TIME column. If the date field you put in your
WHERE clause is a DATE TIME (e.g. 2016/02/17 06:29:13), using ACTIVITY_DAY =
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) will return very few or no rows. Essentially, you would be asking
DW to get all the results associated with the 2016/02/17 00:00:00, the probability of having any
information for the DATE TIME being specified to the second is very low, in any case you’ll miss most
rows.

What is there to do? There are 2 options:

a. Instead of using a one day reference you can add an interval (e.g. ACTIVITY DAY BETWEEN
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) and TO_DATE(‘2016/02/12’,’yyyy/mm/dd’).
b. TRUNC the DATE TIME to just DATE level (e.g. TRUNC(ACTIVITY_DAY) =
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) )

Both options work but option a is the better one for large tables, it’s a lot more efficient. The reason for
that is that by using BETWEEN the query would limit the results to just the timeframe mentioned by you.
Using TRUNC the query would still run against every row of the table which will be very costly for the
system.

b. ADDITIONS and SUBTRACTIONS

You can use additions or subtractions with dates and the effect on their own or combine them with
BETWEEN.

35
OR

Once we get to the


run_date wild card it will
become clear how
additions/subtractions go
hand in hand with DATES.

c. Greater than or less than >,<

The ‘<’ or ‘>’ is the easiest comparison operator to use when you’re looking for data from a certain
point in the past till today.

OTHER DATE FUNCTIONS


Although TO_CHAR(), TRUNC(), and TO_DATE() are probably the most commonly used DATE functions,
SQL includes several more that you may find useful. These include:

ROUND( date , format ) – used to round a date up or down to the nearest day, month, year, etc.

ADD_MONTHS( date , number of months) – used to add (or subtract) months from a date

LAST_DAY( date) – used to determine the last day of the month the date falls in

NEXT_DAY( data , weekday ) – used to find the date of the next day following the date specified of the
weekday specified

MONTHS_BETWEEN( later date, earlier date) – used to determine how many months are between two
dates

36
The results look like this.

If you look at the rows highlighted in yellow, you’ll see the same row 4 times. How can I make those 4
rows into just one, because, imagine you query for thousands of ASINs at a time, getting an extra 4
rows/ASIN will make it hard for you to process the data in Excel?

We’ve already covered DISTINCT, which would do the trick here, but what if I want all the sales in a given
day for one ASIN?

3. AGGREGATE QUERIES
Aggregate means a grouping of multiple rows in just one row.

You have already created aggregate queries when you used the DISTINCT function. By using DISTINCT,
you grouped more identical rows into just one, making the results more easy to use.

The aggregate functions are to SQL what a pivot table is to Excel.

The main aggregate functions are

COUNT – which counts how many values there are in a column

MAX – which finds the maximum value in a column

MIN – which finds the minimum value in a column

SUM – which adds together the values in a column

AVG – which averages the values in a column

MEDIAN – which gives you the middle value from an ordered list of values – we’ll cover this later as it
takes a special syntax

Aggregate functions are placed in the SELECT part of the query. The syntax is as follows: the function
name (e.g. count) followed by the field you want the aggregate function applied to inside parenthesis
(region_id). In this case it would look like this: count(region_id)

37
Let’s take the first 5 aggregate functions and give them a run.

Don’t mind the filters, they are there to


make this query run as fast as possible.

It’s worth mentioning that these


aggregate functions will not all work
with any data type. For example, if you
choose COUNT, MIN and MAX you can
use them with every data type.

But AVG and SUM are only for numeric


values. Trying to sum up a number of
ASINs will result in an error, because
that is a CHAR data type.

Above you can see the results. We know that the SUM of the MARKETPLACE_IDs is comprised of 26
values. This information is given to us by the COUNT. We see the MIN and MAX values, but we already
knew them since the MARKETPLACE_IDs in EU are UK – 3, DE – 4, FR – 5, IT – 35691 and ES – 44551.

You’ll notice that given all these, the MAX is not 44551, why is that?

This means that for the ASIN in the WHERE clause there is no record for the ES MARKETPLACE. That ASIN
does not exist in this table (which is not necessarily an indicator of whether the ASIN is released or not in
ES, or if it has a retail or 3P offer).

What is the name of this table? What information do you think it stores? The table name helps us out a
lot, it’s called D_DAILY_ASIN_ACTIVITY. So this table stores ACTIVITY information at DAILY level. Based on
this, can you say if there is an offer (of any kind) for that ASIN in ES? We can’t deduce that from the table
results, all we know is that for the ACTIVITY_DAY queried (11th December 2015) there was no ACTIVITY
(shipments).

And why since there are only 5 MARKETPLACE_IDs in EU, the COUNT is 26?

These functions, they do not pull distinct values. They pull every instance of the conditions specified in
the WHERE clause. In this case, we know there are 26 DISTINCT references of the ASIN B003Y3M4N6 in
D_DAILY_ASIN_ACTIVITY in REGION 2 – EU on December 11 2015.

What if I don’t need this information, and I want these aggregate functions done only on DISTINCT
values?

DISTINCT with other AGGERGATE functions

38
You can use DISTINCT within a function and it will do the same thing DISTINCT does when you put it at
the beginning of the SELECT clause; it will only pull DISTINCT values.

This is the syntax, and it is the same way when you use it to only
get DISTINCT rows.

Let’s compare the results from the SELECT statement which did
not use DISTINCT on the aggregate functions and the results
from the code on my left.

Without DISTINCT

With DISTINCT

Notice that the COUNT and the SUM are different values, but the AVERAGE is the same. That’s because I
did not use DISTINCT on AVG (not because SQL restricts this). The point is that using DISTINCT in one
aggregate function will not automatically expand to the rest of your functions.

As I have said before, we can use COUNT, MIN and MAX on non-numeric data types.

I have chosen 15 ASINs for this


exercise. I want them counted both
as unique values and as separate
instances in the table. And I want the
MAX value and the MIN value, which
in this case would mean SQL will sort
the ASINs alphabetically ascending
and get the first – MIN and last –
MAX values in that list.

I’ll let you interpret these results.

GROUP BY
Using the SELECT above, and looking into the results, I get the general information. I know I have 15
ASINs which appear in the D_DAILY_ASIN_ACTIVITY table a combined 582 times for the date of

39
December 11 2015, but I don’t know what ASIN appears how many times. Plus, using the
MARKETPLACE_ID is not the best example in terms of relevance.

Let’s take the same SELECT again, once using aggregate functions and once without. We’ll turn to a
finance field once more to emphasize the relevance of aggregate functions.

This is the classic SELECT and you already know what the output looks like. You have to use an Excel Pivot
table to use the results, a lot of work. Plus, the results are going to be huge (I have stopped at the 8th
row but it went on and on).

So let’s try and sum up the NET_ORDERED_PRODUCT_SALES and


have just one line/ASIN. Do that in ETL/SQL Developer and see
what happens. Won’t run, will it? And you get this error “ORA-
00937: not a single-group group function”. What does that mean?

You are basically telling SQL to sum up the OPS (ordered product sales) data, but you are not giving it a
criteria to sum up by. Aggregate functions work the same
as an Excel pivot. If in the values section of the pivot you
have the sum of OPS and nothing else in the entire pivot,
you will see a sum of all OPS without any other indicator,
just like in the examples with the MARKETPLACE_IDs.

What will give meaning to your data is placing the ASIN in


the ROWS section of the pivot.

Now we can use our results and present them in a


comprehensive matter, but, as I have said before,
getting them in this format required some extra
work.

40
More importantly, what the pivot just did was to group the OPS data by a criteria, which is the ASIN.

This is what the GROUP BY statement does.


No aggregate function can work without a
GROUP BY function.

Now the calculations apply at ASIN level. We


can go on up and also get Marketplace
information just by adding it in the SELECT
clause.

Usually what I do is to copy the entire


SELECT clause in the GROUP BY clause, and
delete the aggregate functions, to make
sure I don’t miss anything. It’s a good habit.

HAVING CLAUSE
Once you begin aggregating, you’ll find that you may want to limit your results to records where the
result of an aggregation meets a certain criteria. For example, we might only want to look at OPS for
ASINs with sales in more than one Marketplace. We can’t do this in the WHERE clause, because the
conditions in the WHERE clause are evaluated before we aggregate.

In a nutshell, HAVING is to aggregate functions what WHERE is to any other field.

Syntax wise, HAVING is placed after GROUP BY! Going back to the full list of 15 ASINs, let’s assume I am
interested in the underperforming products. I already know that it is possible for one product to account
for over 6k€ OPS, I have this info
above, but I know there are other
ASINs which do not sell as well. I want
all ASINs which for that date are
responsible for less than 300 €.(this
threshold is not an indicator of any
kind, it’s a random value to
demonstrate a point)

I did decide to go one up and get


Marketplace information as well. Just

41
as in the SELECT clause, multiple elements are separated by a comma in the GROUP BY clause, and not by
AND, as you use in the WHERE clause.

This is what my table looks like. Exactly like an Excel pivot table. Just as in the WHERE clause, multiple
filters are separated by AND. It’s worth pointing out that you don’t necessarily have to have an aggregate
function in the SELECT clause in order to reference it in the HAVING clause, just as you don’t have to
have a column in your SELECT clause in order to also add it in the WHERE CLAUSE.

42
And of course my results are
now limited and I get
information for just one ASIN
out of two.

4. PRACTICE
1. Find a table (other than D_DAIILY_ASIN_ACTIVITY) which contains a financial indicator (e.g. OPS,
revenue, contribution profit, GMS etc).
2. Find the partition details and also find the the relevant DATE column. Select any 4 day interval to
query.
3. Delimitate the 4 day interval you want to query in 3 ways, using 3 different comparison
operators.
4. Pull the current day of the week (both number and name), the week number and the first week
day along the DATE column.
5. For this set of ASINs (B0088PUEPK, B003Y3M4N6, B0093RMTY6, B00FOKN7D8, B00G7LQA1E)
pull the ASIN, the MARKETPLACE (or similar information the table you chose has), the day the
metrics were captured and use which ever aggregate functions you find relevant.
6. Limit the results to ASINs which have information for the given time frame in at least 4 MPs.

IV Joining tables

1. What are partitions?


2. Give examples of partitions we have encountered so far.
3. Name the 3 functions we can use with dates.
4. What is the difference between TO_CHAR and TRUNC?
5. How can I extract the day of the week with TO_CHAR?
6. How can I find the first day of the year using TRUNC?
7. Limit the results of a query to only one day of data using TO_DATE.
8. What comparison operators can I use with DATES?
9. Name the 6 aggregate functions SQL supports.
10. What is the difference between MIN, MAX and COUNT on one hand, and SUM and AVG on the
other?
11. Use DISTINCT with an aggregate function.

43
12. What new CLAUSE do I have to use when I use aggregate functions?
13. Why do we need to use GROUP BY with aggregate functions?
14. How do I limit the results of a query using an aggregate function (e.g. sum(field) > 20)?
15. What is the order of the 6 clauses we have used so far?
16. Give a column example for each data type.
17. Use 3 comparison operators in an example.
18. What is the retention policy of a table?
19. What do you need in order to run an already written query?
20. Name all the comparison operators.

1. JOINING TABLES
By know you have learned how to write basic code, how to browse for table and column information,
how to use that information, how to limit your results to just the ones you need, how to select a time
frame for your data and how to make the most of date usage and how to use aggregate functions.

So far we have been limited to information stored in only one table. Tables can hold over 100 distinct
columns, which is a lot, but given everything that is happening in Amazon and all the specific reporting
needs, even in 100 columns you will not find all the information you need.

What is the solution for this? You could query one table at a time, get a part of the information you need
from one query, a part from another query and then compiling that information in Excel, but that will
take extra time and effort. The way SQL answers this problem is by giving you the ability to join 2 or more
tables.

To be more precise, let’s say I need to know the SUBCATEGORY for a set of ASINs, among other
attributes. I have found that D_MP_ASINS is the table which stores all the information I need, or almost
all. By querying D_MP_ASINS, I can get the subcategory_code, but that will not tell me too much about
what that subcategory holds. I would need a subcategory description/name, but that field is not in
D_MP_ASINS. I have an abundance of tables from which to choose.

44
From experience, I will choose D_DAILY_ASIN_GV_METRICS, where I know I have the column
“SUBCATEGORY_DESC” as well as “SUBCATEGORY_CODE”.

We need this output: ASIN, item_name, GL, subcategory and subcategory description.

Our query will look like this:

There is a lot of new information in here, so let’s try and split this
query up into chunks. Starting with the SELECT you will notice
that every column I pull has a prefix and that same prefix is
placed after the table name, in the FROM clause.

When you join 2 tables, SQL has no way of knowing from which
table it should take the information in your SELECT clause. What
this query says is: get the ASIN, the ITEM_NAME, the
GL_PRODUCT_GROUP and the SUBCATEGORY_CODE from the
table D_MP_ASINS (notice that the column prefix is ‘dmp’ and
the table is renamed or aliased dmp), and the
SUBCATEGORY_DESCRIPTION from D_DAILY_ASIN_GV_METRICS (with the column prefix being ‘gv’ and
the table being renamed the same way).

‘dmp’ and ‘gv’ are called TABLE ALIASES. So basically, what you are doing is renaming the table. This is
not mandatory, you can use the full table name as a prefix (just use a way in which you tell SQL what
table to use for each column), and it would look like this:

It’s not the practical way of using the table aliases, but it will do.

Now that I have explained the TABLE ALIAS, let’s move to the FROM
clause. After ‘From d_mp_asins dmp’ you see another table
referenced, with the keyword ‘join’ placed in front of it. This is the
syntax for using multiple tables in the same query, you must
reference all tables. There is a main table (d_mp_asins) and a

45
secondary table (d_daily_asin_gv_metrics). We’ll cover a bit later the keywords ‘join’, for now, let’s
continue with the code.

After mentioning the 2 tables you have to specify to SQL how to join them and what the common ground
is. You are asking SQL to get you info from 2 tables for the same set of data. But SQL does not know that
it’s the same set of data. It does not know that the ASIN column in D_MP_ASINS has the same
information as the ASIN column in D_DAILY_ASIN_GV_METRICS, so you have to specify it.

Whenever you join multiple tables, you have to use the partitions each table has. How are the 2 tables
partitioned? D_MP_ASINS is partitioned by MARKETPLACE_ID and REGION_ID,
D_DAILY_ASIN_GV_METRICS is partitioned by REGION_ID and SNAPSHOT_DAY (This is information you
can easily find in BI-Metadata as explained in chapter II).

Above you can see how all partitions are used. The 2 tables are not joined on region_id, is this partition
taken into consideration? While the 2 tables are not joined on region_id, this join is implicit through the
marketplace_id join. Marketplace_id 3 is one of the 5 marketplace_IDs in the EU region (region_id = 2)
and marketplace_id is used in the WHERE clause.

JOIN SYNTAX
The syntax described above is what you’ll see in most SQL queries. However, this is a newer syntax,
before that, the tables were separated commas in the FROM clause while the join criteria (the columns
on which the tables were joined), was
mentioned in the WHERE clause.

The results are the same, but the


syntax used above, in the first
example is more straightforward and
it is easier to use. There is also the
added advantage that all the join
criteria is in the same place as the
tables, in the FROM clause.

46
JOIN TYPES
There are more ways of joining two tables, each with its own advantages and each being more suited to
particular queries.

a. INNER JOIN or JOIN

The join used in the examples above is a full join, also known as an inner join (you might even see in
some queries ‘inner join’ instead of just ‘join’ being used, but it’s the same thing).

This is a Venn - Euler diagram. The A ellipses is table D_MP_ASINS. The B ellipses is table
D_DAILY_ASIN_GV_METRICS. What the INNER JOIN does is to take only that which is common from
both tables and place it in your output.

In our query, we are getting the ASIN, item name, subcategory_code and GL from D_MP_ASINS and the
subcategory_desc from D_DAILY_ASIN_GV_METRICS. Let’s say that in D_MP_ASINS, within the
subcategory ‘14700720’ I have the following ASINS: B000000000, B000000001, B000000002,
B000000003 and B000000004. However, in D_DAILY_ASIN_GV_METRICS, the ASINs present in the table
for the SNAPSHOT_DAY I used are B000000000, B000000001, B000000002 (I want to emphasize again
the specific of each table, D_DAILY_ASIN_GV_METRICS will, by the purpose for which it was created, only
stores the ASINs which have glance views on a given date.).

Because I have used an INNER JOIN, and I am thus


selecting only what the 2 tables have in common,
my export is only going to have data for ASINs
B000000000, B000000001,
B000000002.

This is the clearest depiction of what


the INNER JOIN does, and the results
would look like this

47
ASIN Item_name Gl_product_group Subcategory_code Subcategory_desc
B000000000 Item name 1 147 14700720 Internal hard drives

B000000001 Item name 2 147 14700720 Internal hard drives

B000000002 Item name 3 147 14700720 Internal hard drives

b. OUTER JOINS

The INNER JOIN takes whatever is in the middle of the two tables, the green section above. The outter
joins take the other sections. What the outer joins are good for is that they will not limit the results of
the query to only what is common between the queried tables.

- LEFT JOIN

Going back to the diagram, the left join will get all the results in A (D_MP_ASINS) + whatever matches
from B (D_DAILY_ASIN_GV_METRICS). In our case, it means I will see even the ASINs which are not
present in D_DAILY_ASIN_GV_METRICS (they have no glance views for the chosen SNAPSHOT_DAY).

The syntax is almost the same, with the exception that


‘LEFT’ will appear in front of the JOIN keyword.

In some queries, instead of LEFT JOIN you might see


LEFT OUTER JOIN. Again, it is the same thing as with
JOIN and INNER JOIN, don’t mind it.

ASIN Item_name Gl_product_group Subcategory_code Subcategory_desc


B000000000 Item name 1 147 14700720 Internal hard drives

B000000001 Item name 2 147 14700720 Internal hard drives

B000000002 Item name 3 147 14700720 Internal hard drives

B000000003 Item name 4 147 14700720


B000000004 Item name 5 147 14700720

Notice that for ASINs B000000003 and B000000004, there is no Subcategory_desc information. That is
because, as we have previously seen, those 2 ASINs do not appear in D_DAILY_ASIN_GV_METRICS for the
queried SNAPSHOT_DAY.

- RIGHT JOIN

Being an outer join, the RIGHT JOIN does the same thing as the LEFT JOIN, only that the main set of data
is pulled from the table on which you RIGHT JOIN. It’s basically a reverse LEFT JOIN. To illustrate this,

48
let’s say that apart from the 3 common ASINs between D_MP_ASINS and D_DAILY_ASIN_GV_METRICS
(B000000000, B000000001, B000000002), the latter table also has ASINs B000000007, B000000008,
B000000009 for which there is no record in D_MP_ASINS.

The syntax is the same as for the LEFT JOIN, just write ‘RIGHT JOIN’ instead.

Given the data above, this is what our export would look like.

ASIN Item_name Gl_product_group Subcategory_code Subcategory_desc


B000000000 Item name 1 147 14700720 Internal hard drives

B000000001 Item name 2 147 14700720 Internal hard drives

B000000002 Item name 3 147 14700720 Internal hard drives

These results looks exactely like the INNER JOIN results? Why is that?

Let’s analyze the syntax closely. You are asking for the subcategory_code from the D_MP_ASINS
(dmp.subcategory_code). Regardless of the join type, you are limiting the results of this query to
whatever records are in D_MP_ASINS. How can this be avoided? You either SELECT the data FROM
D_DAILY_ASIN_GV_METRICS first and then LEFT JOIN it to D_MP_ASINS, or, better yet, you keep the
same RIGHT JOIN condition you have now, but you make the Subcategory_code one of the join criteria.

By joining on the subcategory_code as


well, you practically put all the
subcategory code instances from both
tables into the same bucket. With this
modification made, the results of the
query will now look like you expected the
first time around.

ASIN Item_name Gl_product_group Subcategory_code Subcategory_desc


B000000000 Item name 1 147 14700720 Internal hard drives

B000000001 Item name 2 147 14700720 Internal hard drives

B000000002 Item name 3 147 14700720 Internal hard drives

B000000007 14700720 Internal hard drives

B000000008 14700720 Internal hard drives

B000000009 14700720 Internal hard drives

This is a good example of what joining on the right columns can mean to your query results.

- FULL OUTER JOIN

49
There is a third type of outer, the FULL OUTER JOIN. Think of this join as the opposite of the INNER JOIN.
If the INNER JOIN extracts information only for the common ASINs in our case, the FULL OUTER JOIN
extracts all information between the two (or more tables). If I would outer join the above tables on
column ASIN, the result would be all an extract which contains all ASINs in table d_mp_asins + all ASINs
in table d_daily_asin_gv_metrics. This type of join is an extreme measure, I have not found an issue that
required this join type so far, but it might be useful one day.

WHERE CLAUSE WITH JOINS


The outer join examples from above are not exactly outer joins. A pure outer join would look like this:

Notice how the where condition which


was refering to a
D_DAILY_ASIN_GV_METRICS field
(gv.snapshot_day) has been
commented out. With the code on my
left I am talking all results from
D_MP_ASINs and everything thatt
matches from the other table. Before, I was still limiting the results with the Snapshot_day WHERE clause
condition. Of course, this code would take ages to run as it would be scanning thousands of
Snapshot_day partitions and billions of rows to get my results.

Always be very careful of what you add to your WHERE clause because it might turn the OUTER JOIN into
a full INNER JOIN!!!

JOINING 3 OR MORE TABLES


Joining 2 tables is pretty straight forward, joining 3 or more tables might turn out to be a bit of a
challenge.

This is probably going to be


the most common use case
for multiple table joins.

For this query I needed an


attribute from each of the
tables in the FROM clause.
There was not any one
table to hold all the
information I needed, so I
just LEFT joined all the
other tables on
D_MP_ASINS to get
everything I needed.

50
I am still rellying on D_MP_ASINS to provide the main bits of information in this query.

As your data needs become more complex you will find yourself in the situation where not even 2 or 3
tables contain the information you need. Even though in the example below, joining 4 tables together
looks easy, in fact you may want to pay special attention to how each table is built, how it is partitioned,
what columns are nullable etc.

V Subqueries, wildcards, segments

1. SUBQUERIES
Subqueries are SQL statements nested inside other SQL statements, like a query within a query. Think of
a subquery as a temp table you are creating, a table of which the results are discarded as soon as your
query is done running. When we’ll take a look at how to optimize a query, you’ll see a very efficient way
of using subqueries which will do wonders for the runtime of your SQL code.

Subqueries can be used in the FROM clause (which I recommend the most) as a table to JOIN on, or in
the WHERE clause to limit your results (which is very cost inefficient).

How do you read this query?

This query is getting all the ASINs


for which the
“IS_VERY_HIGH_VALUE” flag is set
to YES (‘Y‘), and it pulls the
MARKETPLACE_ID, GL and the
GLANCE VIEWS for that list.

What is the difference between the


subquery and the join from the
query on the left and the following
JOIN?

51
The way the query on the left is built is that it gets a list of all ASINs in EU for which this flag is set
(regardless of MARKETPLACE info, just an ASIN list) and then it gets the GVs, MP and GL info for each and
every one. This means that if ASIN B000000000 has the “IS_VERY_HIGH_VALUE” flag is set to YES in DE
and FR, but not in UK/IT/ES, I am still getting information in those locales as well. By using the code on
the right I am limiting my results to just the cases in which the flag is set at MP level.

The relevance of this example is that I know that the value for this attribute should be the same in all EU
countries, but I also know that there are data quality issues with DW, this is a hack to get by one such
possible IDQ issue.

As I have mentioned, I can use that subquery to limit my results by placing the subquery above in the
WHERE clause.

A subquery runs basically like a query, but it stores its results in a temp space, creating a new table in the
process which will be discarded once the query is done running.

It is a lot more efficient to use subqueries when you have multiple tables you need data from. Subqueries
are the first step towards using re-usable code in your queries. Let’s face it, with most queries you’ll
build, you are not going to re-invent the wheel. More so, you may find you have a common denominator
between all the queries you build, meaning you will have a block of information you will always pull. In
my case, almost every query I build has glance view information, because that is a good indicator of how
popular an ASIN is and it also allows me to prioritize which ASINs to fix in case there is an issue
somewhere, or each ASIN to deep dive on in case of a wider problem.

52
Now, whenever I need glance view information in my query, I just have to copy and paste a subquery
instead of writing all that code in my query and figuring out on which tables to join, how to join etc.

2. SEGMENTS
Up until now we have always queried for either a small group of ASINs, limiting our results to only the
ASINs in the where clause, of without giving the query any particular ASIN, instead we gave a general
field for a search, such as all the ASINs in MP Spain which belong to GL Camera and have an Amazon
offer.

But what if you have to get some information for 20.000 ASINs at once? You basically have 2 options:

a. Insert them manually in the SQL statement (this is called ‘hard coding’). This may work for 4.000
or maybe even 20.000 ASINs (depending on the complexity and the efficiency of your query) but
that will not scale for 100k ASINs.
b. Load the ASINs in DW and reference them in your query through a segment .

53
The way you do this is by loading a segment in DW.

1. Create a text file containing your list


of ASINs, one per line, with no
header row. Make sure your ASINs
didn’t lose any leading zeros, and are
all ten characters long. Save this file
to a location you can remember,
such as your desktop.
2. Go to www.datanet.amazon.com, on the ETL Manager tab. Then click on Segment Creator on the
left. This will take you to another page where you will have the following view and information:

3. From the dropdown list at Step 1


(**Select Segment Type) choose
Product Segment.
4. Another dropdown will appear
under Product Segment asking you
to select a Legal Entity ID

The Legal Entity ID is another indicator


for the marketplace. For more
information check this wiki.

5. Enter a description. When adding a


description, apply the same logic
you’re applying when naming your
queries/profiles. You might end up
having a lot of segments and being
able to identify each one fast is
going to be a life saver.
6. Upload the .txt list you prepared at
point 1 above.
7. Click Submit
8. Now you will be taken to a page
which looks like this, only that it’s 6
times bigger:

54
Basically, your segment is now loading on every data cluster. Once the segment is done loading on a
data cluster, you will no longer see it in that view. But you will be able to see it in My Segments.

Click on My Segments and it will take you to an


overview of all the segments you have ever
created.

This overview will give you some basic


information like the type of the segment (the first
thing you selected in the Segment Creator menu),
the number of records in it, the description you
added, the creation date etc. However, the column which holds the information you want is “Segment
ID”. The code in that column is the one you will reference in your query.

55
How do I reference my Segment in a query?

By using a subquery, of course. It’s not the only option but it’s the easiest. All the segments you create
are stored in a DW table which you can then query. For ASINs that table will be
PRODUCT_SEGMENT_MEMBERSHIP.

This is how your


subquery should look,
and you can use this
as your ASIN source in
whatever query you
build.

I have joined on the


subquery the same
way I would join on
any table.

Why should I use segments?

First off, you can load more than juts lists of ASINs in a segment. Depending on what you need you can
load EANs, UPCs, vendor codes, customer IDs, postal codes, merchant IDs and so on in a segment. If you
have a list of elements which you know is a constant (same list of vendor codes, same list of ASINs), using
a segment and referencing them that way is a lot more convenient as opposed to hard-coding the values
into your query which will not only increase the strain on the query, slowing it down, but it will also make
it hard for you or any other user to read and understand your code, given that you will have to sort
through countless rows of endless ASINs.

3. WILDCARDS
We’ll now get to see some of that extra ETL functionality I talked about these previous courses. By using
wildcards you can reference certain information in your query without hard-coding it. Think of the
wildcard functionality as being similar to that of a segment.

The RUNDATE wildcard


Arguably the most useful wildcard, the RUNDATE wildcard is used to reference the date for which you
are writing the query. For example:

Notice how there is no actual


date referenced in the query.
In other words the date is not
hard-coded. So what does run
date mean? Where is ETL
getting the info on the day for
which the query should run?

Well, you are already setting


that up you are running a profile job, remember?

56
The date you select here is the same date which
is referenced in the query above. Imagine you
have a daily report for which you need data.
Normally, you’d need to change your code
every day and update it with the latest dates.
The run date wildcard leaves this task useless,
as everyday your query runs it already looks to
the run date for information on the time frame.

If you need reports for the past 2


weeks/month/quarter and so on, the run date
wildcard can also help you out.

Basically, all the tricks we learned for DATE manipulation can also be used with the run date wildcard.
Keep in mind that just as the rest of the wildcards, run date is an ETL function, so it is not supported by
SQL developer.

The MARKETPLACE, REGION and LEGAL ENTITY wildcards


{MARKETPLACE_ID}, {REGION_ID} and {LEGAL_ENTITY_ID} do pretty much the same thing and work in
the same way. Just as with the {RUN_DATE} wildcard, whenever you will reference one of the 3 wildcards
in question, the SQL code will look to where you have mentioned the values for those wildcards.

And here is where you set your wildcard data:

57
Unfortunately, you can’t use all 3
partition types at once, so you have to
choose between the 3. But this is not
actually that big of an issue as
marketplace_id and legal_entity_id
mean pretty much the same thing, it’s
just a matter of which of the columns
your table contains. As for REGION_ID,
as long as you have marketplace_ID
you can use the EU values to mimic
partitioning by REGION_ID.

This is
where you
can view
the partition
information
you chose.
Look at the
example on
the left and
notice that you are not limited to just one value. In this case I chose the UK, DE and FR marketplaces.

The FREE_FORM wildcard


This is another very useful wildcard. Basically, in FREE FORM you can write anything, because it’s free
form. Think of it as a cell in an Excel file. For example, if I want to query for a list of GLs, you can either
hard code them in the query, as we’ve been doing until now, or write them in the free form field and
then referencing the free form in the query.

You can write the


information you want in
the free form in the job
profile.

Be very careful of what type of data you use in your free form tag. ETL reads the free form tag just as if
the information you’re referencing was hard coded into the query, meaning if you had CHAR or

58
VARCHAR2 information type you’d need to enclose those bits of information in single quotes. A few
examples are vendor codes, brand codes, ASINs etc.

One more thing worth mentioning is that using the free form and
writing it in the query in upper case (e.g. ASIN in ({FREE_FORM}))
means that whatever info you have in your free form tag will also be
converted to upper case.

Take the example of a brand name, if I say: AND brand_name like


‘{FREE_FORM}’, then my query will have no results. That’s because it
will always try to match NIKE (written like this) to Nike (the way the brand name is written in DW) and it
will not match because the 2 values are not the same. If you don’t take this into consideration you can
expect a few blank row queries.

Using FREE_FORM with segments


Another great use for the FREE_FORM tag is to use it to reference segments in your queries. It is not that
much headache modifying a query to change a segment ID, but by using the FREE_FORM tag you can
eliminate editing an entire query, you’ll only have to change the FREE_FORM tag in the job profile and
that will be it. You’ll end up with something like this: AND SEGMENT_ID in {FREE_FORM}, which is a lot
easier to manage.

4. PRACTICE
1. Load the following ASINs into a DW segment.

2. Build a query which, for the segment you have created, extracts the glance views, the
marketplace, the GL, the item name and the brand name – use as many tables as you have to.
3. Using the {RUN_DATE} wildcard, make your query run for the last calendar month.
4. Use the SEGMENT_ID in a subquery.
5. Reference the SEGMENT_ID using the FREE_FORM tag.
6. Limit your results to marketplaces 3, 4 and 5. Do this by using the {MARKETPLACE_ID} wildcard.
7. Extract the above information only for ASINs having over 10 glance views for the given time
frame.
8. Rank the ASINs by glance views so you get the top ASINs at the top of your output.

59
VI Publishers, Decode, Case when, Lower, Upper, Partition by, Coalesce

We’ve covered basic query writing so far. You’ve learned how to search for data, how to identify the DW
tables you need, how to query for the data you want, how to use information from multiple DW sources,
how to aggregate information, how to use subqueries, how to use the ETL functionalities like segments,
wildcards, job scheduling, notifications on query error or success. You have all the basic knowledge to
write simple to advanced ETL queries, but there are still a lot of things which can make your life a lot
easier when writing code.

1. PUBLISHERS
The publisher of a query is another functionality ETL offers. Setting up a publishers allows DW to send
you the results of a query on the email or publish them on a sharedrive. Why is publishing the results of a
query a big deal? First off, it saves a lot of time you’d be losing by manually downloading the results
(having the results sent directly to you via email is very convenient). It becomes especially useful when
the results of your query are really big (exceeding 30Mbs) and manually downloading the results can take
a while. Plus, having the results automatically publish to a location is the first step towards automating
processes/ reports.

Go to DW at the query
page and look to the
Publisher section. Click
New and this will open a
drop menu asking for the
type of publisher you
want.

There are more options you can chose from when


publishing. The ones you will be using the most are the
Email publisher, the Windows Share publisher and the
Share Point publisher.

The Email publisher is the most intuitive one to use.

60
Just choose email from the drop down above,
which will automatically bring you to the
window on the left. Here you can write a
description, or not, it’s not mandatory. Leave
the encoding as is but set the format to Text.
Input the email address in the email addresses
box; additionally you can also chose to have the
results sent as a Zip attachment, which is a
good idea, else you will get the entire output in
an email.

Setting up a publisher on a share drive.

This is a bit trickier, as it requires more than just writing an email address. The Description is the same as
above, you still should chose Text as format and leave the Encoding unchanged, but you’ll have to input
the address of the file. Be very careful when doing this as adding an extra space, even at the end of the
address will results in DW not recognizing the path and giving you an error when publishing.

Also, DW will create the file you want it to, but it will not create the path (the folders leading to that file).
You also need to be very careful and remember to set an extension for the file you’re setting for the
publisher. E.g. \\ant\dept-eu\IAS2\RBS-RO-IAS2-NIS\EFN Flex Tasks\Daily EFN Uploads\EFN Removal\
New.txt.

I’ve done all the steps above and still I get an error when publishing, why is that?

There is one more thing you need to be sure of, that DW has the permissions to write files on the address
you chose. In the folder you want to publish the file, right click in it and go to Properties  Security 
Advanced. In the window which pops up you’ll see all users authorized to publish content.

If you see the user DW-Metron or D. W. Metron, that’s good news, it means Data Warehouse has rights
to publish to your location. This will not always be the case, so you may have to add the user yourself (if
you have permissions to do that) or contact your local IT department to add the user for you. In the case
you can do it yourself, here’s where you need to go to.

61
Clikc Change Permissions  Add…  Advanced and in the name box search for the user DW-Metron 
OK and give DW-Metron the following permissions:

Click OK  OK  OK  OK and that’s it.

Now, going precisely to the folder you want to


publish in may be a bit unnecessary. Instead
it’s better to identify a root folder in which
you’ll want to publish information and just add
the user DW-Metron to that folder, while
selecting the option “This folder, subfolders
and files” from the “Apply to” drop list.

For example, if my publisher address is \\ant\


dept-eu\IAS2\RBS-RO-IAS2-NIS\Main Folder\
Secondary Level\Third Level\New.txt, I could
go into the folder Third Level and give DW-
Metron permissions to write information
there. But this means that for every new
folder in the Secondary Level folder I’d need to
add the permissions to DW again, and again,
and again. So in this case it’s easier to give DW permissions to write into the Main Folder with the option
to also publish in whatever subfolders and files you may find there.

62
In case you are not authorized to add DW-Metron yourself please reach out to your local IT department
for guidance.

2. DECODE
The DECODE function allows you to recode the output data of a field to into whatever you desire. By
using DECODE, you are basically changing some (or all values) which can be found in one column . An
example of that is recoding the MARKETPLACE_ID codes into more easy to understand tags.

In this table I have a


field named
FLIP_TO_IOG, which I
know represents the
IOG of the country
where some inventory
is stowed. I know
what those numbers
mean, but to anyone
else it’s not much to
go by.

Look at the column


renamed MP. I am
telling SQL that whenever the value of that column is 7, it should replace it with UK, when it’s 8, replace
that with DE and so on. Below is the output.

So, I knew all the values in the


column FLIP_TO_IOG, and I
could recode them all. The
syntax is not that hard, it goes
like this: decode followed by a
left parenthesis, the column of
which the values I want to
decode, comma, the first value,
coma, and the value I want in
its place, the second value,
coma, and the value I want in
its place etc and at the end close the parenthesis. Note that I don’t have to input the values
alphabetically, or in order, the DECODE function is not dependent of that.

But what about the second decode, aliased IS_UK? It has an odd number of arguments, how can it work?
Well, that decode is working a bit differently because it tells SQL to convert all the ‘7’ values it finds into
the value ‘YES’, and convert everything else to the value ‘NO’.

DECODE is a very useful function and it does a bit more than I have showed you here. Sure, it’s good Cx
to have appropriate data tags in your reports, and DECODE does in SQL what FIND AND REPLACE does in
Excel, but without the extra work, but in the Tips and Tricks session you’ll see another use for this
function.

63
3. CASE WHEN
Think of CASE WHEN as the big brother of DECODE. It is the SQL equivalent of the IF function in Excel,
with the added value that the it is a lot easier to write than IF and it scales a lot better once it gets more
complex.

Let’s look at the CASE WHEN syntax before going into the examples. First off, notice that you can have
one or multiple conditions in a CASE WHEN statement, you can reference one or more columns, and,
based on the needs of your query, you can generate one or more output values.

Let’s look at the first CASE WHEN statement (CASE


WHEN FLIP_TO_IOG = 7 then 'YES' else 'NO' end
IS_IN_UK). This statement does exactly what the
decode above did, with no difference. It is just
expressed differently, but the outcome is the same.

Notice that the beginning of the statement is marked


by the CASE WHEN keyword and the end is marked by
‘end’. Always remember to place ‘end’, otherwise,
you will not be able to run your code.

The second CASE WHEN is already a bit more complex, it is giving a binary output (‘FR inv in DE’ or
BLANK), just as the line of code above it did, but it is looking at multiple column information. Whenever
the 3 conditions I placed in this CASE WHEN the query will return the value I want.

The third CASE WHEN is the coolest, because it looks, again, at more than one column, and it gives me a
more comprehensive feedback of what it finds. What I did was to basically decode the output of the
conditions I entered.

CASE WHEN is a complex function and can grow quite big depending on what information you want to
base its output on. It will eat up more of the system resources and it will make your query heavier, so

64
keep that in mind. A heavy CASE WHEN could be the tipping point of an already complex and resource
demanding query.

4. LOWER AND UPPER


Lower and Upper are two SQL functions which convert the content of a column in either lowercase or
uppercase. DW stores thousands of tables, some booker, some private (stored on virts), the point is,
there isn’t always a standardized way of writing the information which feeds a column. Take for example
the item_name, in some tables it can be stored in uppercase, in some you can find it in lowercase and in
some each word of the item_name cand begin with a capital letter.

When you limit the results of a query to just one value the column can have ( e.g. brand_name = ‘nike’) if
that value you want to match isn’t written in the exact same way in the table, that will not work and it
will return 0 rows.

The syntax is pretty easy to use, you just enclose the


fields you want to convert in lower case or upper case
in parenthesis, and write the keywords ‘lower’ or
‘upper’ before the field name you’re converting, like
on the example on the left.

If, for the output below, I would have used this


condition in the WHERE clause – ‘and brand_name =
‘bose’ I would have gotten 0 rows in my output
because I was telling SQL to match ‘bose’ to ‘Bose’.
Think of this as a Vlookup formula where you are
looking for an exact match.

It’s even easier to use lower and upper along with pattern matching characters. In the output above I see
that the brand name is Bose, but from experience, multiple vendors/sellers can submit a different brand
name for what is essentially the same brand. My brand could have been named in DW as ‘Bose
electronics’ or ‘Bose Audio’ etcetera so even if I would have gotten the lower/upper case right, I still
would have found nothing or I would have missed some records. But if I phrase my filter as such: ‘and
lower(brand_name) like ‘bose%’ ‘ then I increase the odds of finding everything I want.

5. PARTITION BY and MEDIAN function


Remember we talked about the aggregate functions? Basically what they do is to aggregate (compile) the
results by whatever criteria you choose. To do that you had to use the GROUP BY function which had its
limits.

65
Partition by is a function which allows you to get aggregated results without being obligated to use
GROUP BY. For example I can write the following statement in 2 ways and they both work the same:

There is no difference between the 2 versions so why use PARTITION BY at all? PARTITION BY is an
example of SQL functionality which you can use to further process your results. You basically trick SQL
into doing something that it won’t do using normal functions (group by).

Basically I have grouped the GVs only by 2 parameters instead of everything I had in my SELECT
statement and I also have a numbered count (aliased ‘count1’). I’ll share some examples in the tips and
tricks portion of this course how partition by helps you.

66
One case where it helps you is with the MEDIAN aggregate function. First off, what is the median? It’s the
number separating the higher half of a data sample, a population, or a probability distribution, from the
lower half. When is it best to use the median? When you have an irregular distribution of values which
makes the average less relevant. For example if you had a list 12 vendor costs for the same ASIN, 11
would be around the value of 7$ while the 12 th would be equal to 50$, calculating an average sourcing
price here would not be an accurate depiction of reality as Amazon would not chose the most costly
option when it has 11 cheaper options. If you would present a report showing the average sourcing cost,
you’d see that you can get that product for an average 10.58$, when in reality Amazon got it for around
7$. In this case, the median removes the possibility of data corruption from bad data.

Where in the case above, the average cost is 10.58$, the median cost is 7$, which is also a lot closer to
reality.

You can see the differences easier here. The query above pulls
the average GVs at EU level, the median at EU level and the
median at MP level. Check the results below. Now I have
brought all in this view without using extra Excel functions.
What if instead of glance views you had sales, or GMS, or CP?

Partition by is essentially a way of tricking SQL into providing


you with the output you need.

6. COALESCE
This SQL function fuses 2 or more column outputs into one single output. That is what concatenating
does, right, why do we need another function for this? While concatenating will bring together the
information of several columns + whatever else elements you desire, COALESCE does something else. It
takes a number of columns (2 or more) and returns the value of the first column which is not blank.

Let me illustrate this with an example so it becomes clearer. The table D_MP_ASINS has 3 columns which
basically hold the same information, the brand name. Those fields are: publisher studio label, brand

67
name and merchant brand name. None of these fields is null proof and it’s generally good information to
know what the brand of the product you’re querying is.

The results look like this. Notice that the field ‘brand_name’ is blank. Had I relied solely on it I would have
missed some important information.

68
VII Tips and tricks

Working with ETL for a while will get you to some realizations about what works, what doesn’t, how to
build a query to scale better etcetera. You can learn a lot from reading other people’s queries,
deconstructing them , seeing how each table is partitioned, how it is joined, generally it’s a good exercise
to expand on your knowledge. Below you will have some easy to use tricks I found while trying to add
more functionality to my queries or make them scale.

1. HOW TO GET FROM A TO B WHEN WRITING A QUERY


When you’re writing a query, you generally have a clear goal in mind. Whether the goal is reporting or
simplifying a task, the process to write that query is the same.

You’ll need to watch out for a few things or go through several steps. The steps below, this is my process,
it’s what I do, it’s what I found works for me and makes my work more efficient. However, what works
for some doesn’t necessarily work for others, so take the steps below with a pinch of salt.

What I do is ask myself the following questions, in the sequence presented below:

a. What am I trying to accomplish?


Am I looking to create a report? Am I looking to get a plan approved and I need data to back up
my initiative? Do I want to simplify a task? It’s important to ask these questions for scalability
reasons, for example, getting data to support an initiative is a one off. This means you don’t need
to spend as much time trying to integrate the query with analytics tools you use, you don’t have
to comply with a pre-determined output type.
b. Am I reinventing the wheel?
Creating a report from scratch, creating data for a business case from scratch will take time. You
want to make sure you don’t spend countless hours on building something from the ground up
when there was already a query doing the same thing or something similar. Building a query
takes time, time to design, time to write, time to test, time to amend and retest. Reach out to
your pears, to your BI colleagues, look through what you have written in the past, browse
through SQL Library.
c. What do I need my output to look like?
Take time and really think this through. Good planning from the very beginning means you’ll save
time on adding complexity to your query and making changes because you realize there is data
you may need as well. Think about the level of detail you need (e.g. do you need ASIN level
details, user level details or can you go directly to subcategory level/ team level for the results).
For me it’s easier to make the Excel header I need, fill it with fictitious data just to see what I
could get from there.
You should not start thinking of how to build your query just yet, just focus on your output
d. How do I make my output look the way I want it to.
In step B above I talked about designing the way your output needs to look like. Now we’re
diving deep on how to achieve that output. This is the more challenging part but, for me at least,
it was the part which helped me expand my SQL knowledge. Always push yourself to outdo your
previous efforts.
A concrete example of designing an output and then finding the tech solution for it is this one.

69
e. What sources of error can I have with my output?
- Logical sources – conditions in my query which counter each other (e.g. region id = 2 (EU) and
marketplace id = 1 (US) )
- Technical – erroneous joins in which I lose information or which create duplicate rows, tables
which do not contain the info you expect etc.
f. How do I test the accuracy of my output?
Presenting inaccurate, irrelevant or just plain bad numbers will though off your entire
argumentation and business case. I had to find this out the hard way and I was really lucky to be
given a second chance to pull the data again.
There are several things you can do in this case to insure maximum data accuracy:
- Analyze output versus expectations – if you’re creating a new report for your team’s productivity
for example, you can easily spot missing users from that report
If no expectations are available:
- Try to refer to other existing tools which do not rely on DW. For example you can rely on CSI to
test certain attributes, you can try Alaska if your query pulls inventory levels, you can use BIW to
test glance view information etc.
- Write your code in different ways and relying on different tables . As stated in the chapters above,
each DW table has its own specific, its own way of aggregating the data, its own different level of
granularity and in the end even different data. Even though it would be safe to assume you’d find
the same information everywhere, that’s rarely the case, most of the times not by error. If you
build your query in 2 or 3 different ways and the results point in the same direction, you’re safe
to use that data.
- Ultimately, test out every piece of your query, break your query into smaller chunks and analyze
the results of every join, the data a subquery and how (and if) it changes when brought into the
main select.

2. WITH CLAUSE
The advantages of using subqueries over normal joins has been presented in chapter V, so I’ll be skipping
that part. I showed you how you can add a subquery in the FROM clause or even in the WHERE clause (as
ASIN in (subquery)) but there is a far more efficient way of using subqueries, which is declaring them at
the beginning of you code using WITH.

70
The syntax is not very complicated, at the beginning of the query you start with ‘WITH [subselect alias] as
followed by the subselect. When you want to add another subselect, define it as the first one, just
replace ‘WITH’ with a comma ‘,’. This is just as listing a set of columns in the select, only that you’re
listing a set of temp tables you are creating.

So why do it this way and not the classic way? There are at least three reasons that come to mind.

a. The query is a lot easier to read because it is well organized. Instead of staring at a big number of
rows in the FROM clause trying to break the different tables and subselects used, how they’re
build, how they are joined everything is well delimitated, you have the tables you are querying at
the top, and the main SELECT is easy to read.
b. For a reason which hasn’t been explained to me but I have seen it happen with queries I did not
hope to ever run, this way of writing the code makes your query very efficient. I have had queries
which I could not run at all, they would always error out due to extensive running time, which,
when reorganized using ‘WITH’ ran successfully in under 30 minutes.
c. It is easy to break apart and test. If you are not sure of how the results look, this syntax allows
you to spot the error easily. You can test the output of one subselect at a time and search for
duplicate rows, output not as you would have expected (e.g. blank columns). It is much easier to
find the root cause of your error.
d. This makes building queries on the go quite effortless. For example, in the query above, the ‘asn’
subselect, which’s only purpose is to provide me a list of ASINs, can be adapted to any query. I
can almost always rely on that piece of code as the part which gives me the selection I want to
run the query for. Since it is already tested, you don’t have to do that when copying it to a new
query.
Here are some other examples of subselects I often use which allow me to build a query swiftly:

Subselect for EU5 glance views


,gv as (
SELECT --extracts glace views for past x days at ASIN and MARKETPLACE level

71
MARKETPLACE_ID
,asin
,sum(nvl(GLANCE_VIEW_COUNT,0)) as GVs
FROM D_DAILY_ASIN_GV_METRICS
WHERE REGION_ID =2
AND MARKETPLACE_ID in (3,4,5,35691,44551)
AND SNAPSHOT_DAY between TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD') - 7
and TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD')
AND IS_SUPPRESSED_ASIN in ('N')
GROUP BY
MARKETPLACE_ID
,asin )

Subselect for OPS (ordered product sales), units sold and GMS (Gross merchandise sales)
dem as (SELECT
asin
,marketplace_id
,sum(nvl(NET_ORDERED_UNITS,0)) as demand --units sold over x period of time
,sum(nvl(NET_ORDERED_PRODUCT_SALES,0)) as NET_OPS --net OPS over x period of time
,sum(nvl(TOTAL_GMS,0)) as TOTAL_GMS -- GMS over x perios of time
FROM
D_DAILY_ASIN_ACTIVITY
WHERE
REGION_ID=2
AND MARKETPLACE_ID in (3,4,5,35691,44551)
AND MERCHANT_CUSTOMER_ID in (9,10,11,755690533,695831032) - - Amazon EU
Retail
AND ACTIVITY_DAY BETWEEN TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD')-365
AND TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD') --x period of time
group by
asin ,marketplace_id )

Subselect for EU inventory level


instock as (select /*+USE_HASH (inv,wh) */
inv.asin
--,decode(inv.INVENTORY_OWNER_GROUP_ID,7,3,8,4,9,5,75,35691,85,44551) as
Marketplace_id
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 7 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_UK
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 8 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_DE
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 9 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_FR
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 75 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_IT

72
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 85 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_ES

FROM
D_INVENTORY_LEVEL_BY_OWNER inv
join o_warehouses wh on wh.warehouse_id = inv.warehouse_id
where
inv.INVENTORY_OWNER_GROUP_ID in (7,8,9,75,85)
and inv.region_id = 2 and inv.snapshot_day =
TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD')
and inv.INVENTORY_OWNER_GROUP_ID =
decode(wh.organizational_unit_id,2,7,3,8,8,9,29,75,30,85) - - joining IOG on ORGANIZATIONAL
UNIT ID to only pull inventory stowed in the owning MP
group by
inv.asin)

Subselect for EU buyable selection for a given day


dmo as
(
SELECT /*+USE_HASH(dmo) */
distinct asin
,MARKETPLACE_ID
FROM
D_DAILY_BUYABLE_OFFER_LISTINGS dmo
WHERE
dmo.REGION_ID = 2
AND dmo.MERCHANT_CUSTOMER_ID in (9,10,11,755690533,695831032)
AND dmo.MARKETPLACE_ID in (3,4,5,35691,44551)
AND dmo.SNAPSHOT_DAY = TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'yyyy/mm/dd')
)

3. ADDING AN INTERMEDIARY TABLE TO BRIDGE TO OTHER TABLES


Joining tables is not always an easy thing to do, especially when the common columns are not enough for
an efficient join, or any join for that matter. By being resourceful you can do a great many things which
initially seem impossible, just think outside the box.

Let’s look at this example and then break it off into pieces explaining what is happening.

73
Try not to focus on
that new syntax in the
SELECT. That’s a
function called
substract (it does
exactly what =LEFT or
=RIGHT formulas in
exce do, it gets a
numbers of characters
you specify from a
starting point you
specify)..

I’ll start by stating the


prupose of this query.
I want to get the minimum vendor cost for a set of ASINs along with all the vendor_codes which provide
an offer for the ASINs and their respective costs. I want to limit my results to Marketplace 3 (UK) vendors
only. I have identified the table which has all vendor codes, ASINs and costs (D_VENDOR_COSTS), but
that table does not have any marketplace_id information. I would have to get that from another source. I
have identified that O_AMAZON_BUSINESS_GROUPS has something I can use, which is the ‘Type’
column.

Looking at the description in the sample data, I see that the TYPE column values always start with the
country prefix (e.g. UK, CA, JP, DE etc), so I can use a trick to get the marketplace information from
there(the substract function). However I have no way of joining D_VENDOR_COSTS and
O_AMAZON_BUSINESS_GROUPS, there are no common columns.

74
Which is where table
VENDORS. I use this table
as a bridge between
D_VENDOR_COSTS and

O_AMAZON_BUSINESS_GROUPS

Notice that I have joined D_VENDOR_COSTS to O_AMAZON_BUSINESS_GROUPS telling SQL that


ven.vendor_pub_code and v.primary_vendor_code store the same information. Then I joined
D_VENDOR_COSTS to O_AMAZON_BUSINESS_GROUPS, this time with v.amazon_business_group_id
storing the same information as bg.id. This is what I wanted to point out, when joining more than 2
tables, you are not limited in your joins to just columns from the main table (in this case,
D_VENDOR_COSTS). The second you joined D_VENDOR_COSTS to VENDORS, you have created a
temporary table, containing all the information the two tables have stored for all the vendor codes they
have in common. Those two tables now act as one, which is why you can use information from VENDORS
when joining on O_AMAZON_BUSINESS_GROUPS.

Essentially, this is a way of bridging two tables which otherwise could not be joined.

75
4. USING SUBQUERIES FOR MULTIPLE CALCULATIONS

Let’s look a bit at this query, what it does and why it does it. Again, don’t focus on the functions or
statements we have not covered yet, they’re less important. I needed this query to pull 30% of all YUMA
cases each associate handles per closing reason. So if an associate works 10 cases/ reason a, 10
cases/reason b and 10 cases/reason c I would need the query to pull 3 case_ids for reason a, 3 for reason
b and 3 for reason b. At the same time, if the associate in question worked less than 3 cases/ reason I
would need all those cases in my extract. This requires an advanced algorithm which is hard/impractical
to implement in one query.

So I have a big select containing all the info I want from this query, including the rownumber for all
combinations of user_id + reason. This means that the results from my first layer look like this:

Case id User_id Reason Count 1


Aaa23 Id_1 1 1
Aaa24 Id_1 1 2
Aaa25 Id_1 1 3
Aaa26 Id_1 1 4
Aaa27 Id_1 1 5
Aaa28 Id_1 1 6
Aaa29 Id_1 2 1
Bbb21 Id_2 1 1

76
Bbb22 Id_2 1 2
Bbb23 Id_2 1 3
The second layer extracts everything from the 1st layer and gives me the max row_count per
combination of user + reason. The table would look like this.

Case id User_id Reason Count 1 Key


Aaa23 Id_1 1 1 6
Aaa24 Id_1 1 2 6
Aaa25 Id_1 1 3 6
Aaa26 Id_1 1 4 6
Aaa27 Id_1 1 5 6
Aaa28 Id_1 1 6 6
Aaa29 Id_1 2 1 1
Bbb21 Id_2 1 1 3
Bbb22 Id_2 1 2 3
Bbb23 Id_2 1 3 3
The 3rd layer contains my algorithm which has all the data it needs to mark the case_ids I want to audit.
That algorithm is written as a case when statement.

case when IS_REOPENED = 1 then 'Y'


when RESPONSE_VALUE = 'N' then 'Y'
when key <= 6 and count1 = 1 then 'Y'
when key <= 6 and count1 = 2 then 'Y'
when key >= 7 and count1 < key *3/10 then 'Y'
else 'N' end as IS_FOR_AUDIT
This is what the third layer results look like.

Case id User_id Reason Count 1 Key IS_FOR_AUDIT


Aaa23 Id_1 1 1 6 Y
Aaa24 Id_1 1 2 6 Y
Aaa25 Id_1 1 3 6 N
Aaa26 Id_1 1 4 6 N
Aaa27 Id_1 1 5 6 N
Aaa28 Id_1 1 6 6 N
Aaa29 Id_1 2 1 1 Y
Bbb21 Id_2 1 1 3 Y
Bbb22 Id_2 1 2 3 Y
Bbb23 Id_2 1 3 3 N

77
78
79

You might also like