Professional Documents
Culture Documents
Research Paper SQL
Research Paper SQL
Research Paper SQL
I Intro............................................................................................................................................................2
1. What is SQL...................................................................................................................................2
2. What is ETL?.................................................................................................................................2
3. SQL vs Excel...................................................................................................................................3
4. Writing basic code........................................................................................................................3
5. Creating the first query.................................................................................................................3
6. Scheduling a JOB...........................................................................................................................7
7. Types of elements.......................................................................................................................10
8. CONCATENATING........................................................................................................................10
9. RENAMING COLUMNS................................................................................................................11
10. PRACTICE.................................................................................................................................12
II Where, order by, tables...........................................................................................................................14
1. Tables in DW...............................................................................................................................14
2. WHERE clause.............................................................................................................................21
3. Comparison operators................................................................................................................22
4. ORDER BY....................................................................................................................................25
5. USING COMMENTS.....................................................................................................................26
6. Practice.......................................................................................................................................26
III Partitions, aggregate functions, group by, DATES...................................................................................28
1. PARTITION USAGE.......................................................................................................................28
2. DATES.........................................................................................................................................29
3. AGGREGATE QUERIES.................................................................................................................36
4. PRACTICE....................................................................................................................................42
IV Joining tables..........................................................................................................................................43
1. JOINING TABLES..........................................................................................................................43
V Subqueries, wildcards, segments...........................................................................................................51
1. SUBQUERIES...............................................................................................................................51
2. SEGMENTS..................................................................................................................................53
3. WILDCARDS.................................................................................................................................56
4. PRACTICE....................................................................................................................................59
VI Publishers, Decode, Case when, Lower, Upper, Partition by, Coalesce..................................................60
1
1. PUBLISHERS................................................................................................................................60
2. DECODE......................................................................................................................................63
3. CASE WHEN................................................................................................................................64
4. LOWER AND UPPER....................................................................................................................65
5. PARTITION BY and MEDIAN function..........................................................................................65
6. COALESCE...................................................................................................................................67
VII Tips and tricks........................................................................................................................................69
1. HOW TO GET FROM A TO B WHEN WRITING A QUERY..............................................................69
2. WITH CLAUSE..............................................................................................................................70
3. ADDING AN INTERMEDIARY TABLE TO BRIDGE TO OTHER TABLES.............................................73
4. USING SUBQUERIES FOR MULTIPLE CALCULATIONS...................................................................76
I Intro
1. What is SQL
SQL (short for Structured Query Language) is a database management programming language.
Using SQL you can:
For the moment, from this you must take away that this training will teach you how to extract
the data you need for reports, analyses and project scoping and from where.
2. What is ETL?
ETL is the integrated platform Amazon uses for SQL. It is more user friendly than SQL Developer
or other programs used to interact with databases and it is customized (and customizable!) for
the needs of Amazonians.
ETL is short for Extract, Transform, Load.
a. Extract is a pretty intuitive term, and it refers to the data extraction. It is the part of ETL
which we will use in the following weeks, and the only one this training will focus on.
b. Transform & Load – this part refers to the table creation side of ETL. Transform is the query
you write, which gathers whatever data you select and then loads into a table you can
access.
2
3. SQL vs Excel
In many cases SQL is just like Excel. Throughout this training you will see how different
commands have an exact Excel counterpart and do, for the most part, the same thing. As long as
you have a general idea what basic commands do, it won’t be hard to catch on with relative ease
.
Imagine you have a folder with multiple excel files (tables). Each of these files would store
different information, organized in columns. Let’s imagine you have an Excel file named “Amazon
FCs1” which has multiple columns like FC name, FC location, FC capacity etc. What do you do if
you want to get that data, and put it in a new file, meaning you want to create a new output
containing data relevant to you?
If you want an Excel sheet or file with just the FC id and the FC name, you go into the file called
“Amazon FCs”, you would SELECT the columns which have the information you need, let’s call
them “Warehouse_id”, and “Name”, FROM the file which holds those columns (“Amazon FCs”)
and paste them in the output you need.
FROM D_WAREHOUSES
SELECT & FROM are the two basic requirements of any SQL query. It’s pretty logical that if you
want any data at all, you need to tell the SQL what you want and from which location to get it,
just as you would a real person.
a. Go to https://datanet.amazon.com, tab ETL Manager and click on Data Feed from the menu on
the left (if you don’t have permissions please see this link).
1
Fulfillment Centers
3
b. Once you click Data Feed, you will get to the Edit Data Feed Profile - New Profile screen, which is
basically an editor and a worksheet for you SQL code
Remember to always name your query in a manner that will make it easy to find. This may not
seem like a big deal now, but when you reach 90 – 100 queries under your login spanning more
than half a year, it will be difficult to remember what each query does, and reading through them
will prove time consuming.
c. Write in your code, which in this case will be SELECT WAREHOUSE_ID, NAME FROM
D_WAREHOUSES. Remember the two requirements of any query, the SELECT and the FROM
keywords.
d. Click save on the bottom right corner.
We’ll take a little break from this to explain what you are seeing now, so we’ll step away from
SQL and head into ETL territory(again, ETL is the UI that Amazon has built for SQL).
4
The Profile section includes the basics of the ETL job – who created it and who modified it last,
and even a history of every change that’s been made (with the option of reverting back to an old
version). It also includes the Profile SQL section, at the bottom, which is where you write or
paste in your SQL. But most importantly, it has the Edit Profile button. Every time you need to
change/ edit your code you’ll have to press Edit Profile.
The Publisher section is where you indicate what publisher type you want to use (if any), the
format, and the details of where to publish, or to whom to publish. A publisher will allow you to
have the data extracted from DW by your query sent to you via email, published on a share drive
or a share point etc. The publisher is very useful when you will rely on daily/weekly output from
a query which could also be quite big (>30MB). Using a publisher can save you lots of time, we
will handle the specifics in a later part of the training.
The Job section is where you indicate the time zone for the job, the priority of the job, what
database to query, what user to use to log into that DB 2, and is where you can define some
wildcards and establish scheduling (which we’ll talk about in future classes).
2
Database
5
No profile can run without a job, so you’ll have to create one for every one of your queries. There
is a bonus here, which is you can add multiple Jobs to the same Profile and, using some other ETL
tricks you can bring more functionality to your data extraction.
Going back to your ETL page, you’ll notice you don’t have a JOB for your Profile, so let’s create
that now. In the Job section of the page, click on ‘NEW’.
This will prompt you to a new page which will require you to fill in some information, so we’ll go
through it one by one. Highlighted below you’ll find the fields you must fill in, along with info on
what to choose. If you fail to update or omit any of them, don’t worry, Datanet will highlight the
missing information.
a. Description – This is already filled in, but you can change it to something which has more
meaning to you.
b. Time zone – I always use the timezone closest to ours, which will be Europe/Paris. You can
choose a time zone from the drop down list
c. Group – This part is about who can download the results of your query, who can edit your query.
Default is personal, and this means that only you can view the results and make changes. I
normally use a group from the drop down list as I think it’s easier to handle changes, particularly
when I want my code to be vetted/ used by someone else. The alternative for this(in case the
group is set for personal) would be to copy the profile, which is a bit more work and may have
other implications, we’ll review them later on
6
d. Priority – believe it or not, you won’t be the only person running a SQL query at any given
moment. There are always tens of thousands (at least!) of queries queued up to run. Define
the priority in relation to the business needs of your query.
e. Partition type – we will talk more about this when we cover the Wildcards. For now just know
that you have to choose a partition (I usually go for Region and choose the value ‘2 – EU’) but
that it will not in any way influence your Profile.
f. Extract Specific Options – Always use the L-DW database. The users which should grant you most
access are APPS_RO, OPS_RO, RETAIL_RO. And for Extract type, when you are querying for data
always use Data Feed.
Notifications – this is not mandatory, but, if you don’t want to check your query every 5 minutes to
see if it has run, you can have ETL send you a notification. The process is pretty straight forward and
self-explanatory. You can also explore the ‘Create New Notification’ button, that will add a lot
functionality which, to me, is only suited for high impact queries (Business Critical or Omega Critical
priority level)
6. Scheduling a JOB
7
I previously briefly mentioned that you can schedule a job to run whenever you want. Normally,
you’d have to start it ad-hoc, but what if you need a new set of data every day/week/month? Then
scheduling is the way to go.
You can schedule the job to run daily, multiple times a week at your choosing, monthly etc, just
select/tick the boxes which agree with your planning and you’re done
Click save in the bottom left corner, and datanet will return you to the previous page where you’ll
see the Profile, the Job and the profile SQL.
You are now ready to run your first query. Just click the run button on the profile job section and that
will prompt your query into running. The query will go through 5 phases which I will describe below.
8
SO you have clicked ‘Run’ and now you are requested to select a run date for your query.
Remember, DW has a 1 day lag3, so you will never have live data. What you will have, with every
query, is a snapshot which is at least 1 day old. This means that, if you choose the current day as run
date, your query will only run once the DW tables have loaded. All information which is live now will
be stored in DW tomorrow. The point is, always choose the run date at least one day into the past.
Now that the query has a run date, it is ready to start assessing your code, the dependencies, the
costs etc and eventually run.
New – The job run is assessing your code, looking at the table names you told it to fetch the
information from, column names etc. If there is a bug in your code, the query will not pass into the
next stages.
Waiting for Dependencies – The job run is looking at the table(s) it needs and connecting to them. If
you are stuck in this stage for more than a 5 minutes this means that one of the tables you need has
not yet loaded for the run date you have chosen, in which case you can either wait for it to load or
delete the Job run an chose an earlier run date.
Waiting for Resources – Your code has been assessed, the dependencies as well and everything
checks out. Right now your job is queued up for running. Whether it runs sooner or later depends on
the size of the DW queue and the priority you have set for your query.
Executing – DW has started fetching the data you requested via your SQL code.
Success/Error – the job run has finished. If it’s successful you can proceed to download the
results, if not…we’ll cover that later.
3
The information for today will only be available tomorrow in DW tables
9
What else can you have in your SELECT clause?
7. Types of elements
There are multiple types of columns you can extract from DW. It may be a column from a table, it
may be a field you need.
Your extract would look something like this. At this point you’re probably wondering about the necessity
of column 'THE WAREHOUSE IS CALLED'.
10
8. CONCATENATING
In SQL, just as in Excel, you can concatenate 2 or more values/ columns together. If in Excel you’d use ‘&’
between values or the =CONCATENATE() formula, in SQL you have two bars to mark a concatenation:
‘||’.
'THEWAREHOUSEISCALLED'||NAME
THE WAREHOUSE IS CALLED Amazon 倉庫
THE WAREHOUSE IS CALLED 強靱さが
THE WAREHOUSE IS CALLED Mopure Foods - Swanton
THE WAREHOUSE IS CALLED Paragon - Nevada
THE WAREHOUSE IS CALLED Jifram Extrusions, Inc
THE WAREHOUSE IS CALLED Ferguson Ent - Frostproof #761
THE WAREHOUSE IS CALLED XrSize - Shannon MS
THE WAREHOUSE IS CALLED WGF (Crystalcreek) - WA
THE WAREHOUSE IS CALLED [INACTIVE] SAI - Sedalia
Notice that there is a space between the end of the text string and the second quote. Just as in Excel you
have to add spaces when concatenating (e.g. =CONCATENATE(“THE WAREHOUSE IS CALLED”,”
”,”NAME”), the same must be done in SQL if you want spaces in your concatenated value.
This is an example but let me give you a better example of how concatenating 2 or more fields can be
really helpful in a report.
Now I have the GL name and code information all in one field.
REGION_ID || '-' ||
MARKETPLACE_ID
1-1
2-3
2-4
2-5
2-35691
2-44551
It’s a neat trick that I can concatenate any fields (literally any fields: columns, text strings, pseudo
columns etc) but from the looks of it, the more fields I concatenate, the longer the column name is going
to be.
11
9. RENAMING COLUMNS
SQL has this function which allows you to change the name of the column in your extract. This means you
are not limited to column names such as MARKETPLACE_ID or ITEM_NAME. This is a good opportunity to
review the elements you can have in a SELECT clause.
This is how your export will look. While these are not the most intuitive examples of why it is good to use
a column alias, the column alias’ relevance will become obvious while we get to more complex code; but
let me give you an example first.
This is called a case when function, it is the equivalent of a ‘=IF’ formula in Excel, and yes, we will cover
this later on in our training. Notice that this ‘case when’ function is renamed ‘EXCLUSION_REASON’, and
all 12 rows of that function represent one column. Without this column alias, column name would be the
exact piece of code, all 12 rows of it, which is not only unusable, but it tells a person who is not familiar
with this query nothing about the data it pulls in this column.
So take this as advice, try to have the end client in mind when you build your queries, and make the data
as easy to understand as possible, it will save both him and you a lot of time!
12
10. PRACTICE
1. Build a query which pulls a list of all organizational units id in Amazon. The table you need to
query is o_warehouses. Run the job and see the results.
What you see in your results is that you get the same values over and over, why is that?
When you’re asking DW to fetch the organizational unit ids from the o_warehouses table (or fetch any
information from any table for that matter) it’s going to give you every instance of the organizational unit
id in that table.
Use ‘Distinct’ to make sure your query returns only unique rows. Consider ‘Distinct’ the ‘Remove
duplicates’ you have in Excel. You will use it after ‘SELECT’ and before the first column name.
13
II Where, order by, tables
1. Tables in DW
Now you know the basics of any ETL query, the SELECT and the FROM statements. But the tables used in
the examples in Chapter 1 are not the only ones in DW. How do you know which tables(s) to interrogate
in order to get the information you need? Easy, you go to BI metadata where you can search for the
columns and tables.
Let’s do a short overview of what you’ll find there. There are 5 tabs in BI Metadata (BI stands for Business
Intelligence).
Home is the overview/welcoming page. You’ll notice a search bar at the top where you can search for
everything from columns, tables to BI questions.
Business questions – picture this is an FAQ section where users ask questions regarding the information
in DW tables and other users respond. When you’re stuck on a query or you’re having problems
understanding why the table is structured the way it is or you’re not sure what information that table
holds, this is a good place to search for answers.
14
Tables
This is an overview of all DW tables. There are some things worth pointing out here:
a. Tab ‘Table Owner’. There are 2 kind of tables – Booker tables (the ones you can use almost all
the time with just the standard permissions) and Virt tables.
- Booker tables are for the most part public. These are tables managed by the BI team, they
usually have one of 3 prefixes – D_, O_ or A_. D_ tables are the ones you want to use , D stands
for denormalized which means that the table is compiled from multiple other tables. O_ stands
for original tables. A_ tables are the ones you want to avoid as they’re generally built for
reporting purposes and may not contain full sets of data.
- Virt tables are private tables. These are usually created to suit the needs of teams in terms of
reporting, or to ease the strain on queries which are too heavy to run (we’ll cover this in the tips
and tricks section towards the end of this training). A good example of data going into a VIRT is
Vendor Central data.
The takeaway is that you should try to use D and O tables as often as possible.
b. Tab ‘Description’ – here you’ll find (at times) a description of the table; who owns it, what you
should expect to find
c. Tab ‘Partition Keys’ show information about how to optimize your code (use of partitions which
we’ll cover in detail in Session 3).
15
d. Tab ‘Retention Policy’ tells you for how long information is stored in that table. Think of this as a
lifecycle for data. For a table where the retention policy is 15 months, it means you will not find
any records older than 15 months in that table. Basically everyday rows of data are being deleted
because they have ‘expired’.
e. Filter search box – on the top right – most important to use as it is very efficient when browsing
columns especially. You can filter down to a keyword which matches the table name.
Unfortunately, advanced filters do not work in BI metadata, searches here will return exact
matches (e.g. if I search for ASIN I will find all tables with ASIN in the name, if I search for
ASIN_MARKETPLACE I will find all tables which contain this syntax in the name).
My instant data – this is the holy grail of BI. That’s because you can get a 1 row instant sample of the
information in that table using parameters specified by you. This becomes even more useful when you
have to do ASIN level deep dives.
Now fill in the required fields and voila, you’ll have all the information that table has for that row.
16
Using the Advanced Search field
This advanced search will save you a lot of time and effort. From the very start you can eliminate Users’
tables (virt) from your results – which we already know we can’t query without special permissions, you
can choose whether your input in the search box should apply to Table/Column names only or also
descriptions etc.
Table overview
Ok, so now we know how to search for the tables and columns we need. Let’s go and explore one table
at a time and see what’s there. To enter one table just search for one as you just learned above and click
on it. That will take you to a page just like the one below. For this example I chose a table I use frequently
– D_MP_ASINS
17
You’ll see some common information such as the Description, the Partition Information. What’s cool is
that you have a special section called Associated Business Questions. I urge you to read that part (or
browse through it) before using that table, you may find very useful tips there.
Equally useful is the ‘Show Columns and Sample Data’. Click on it and it will take you to a view of all the
columns, where you can see the data type (e.g. NUMBER, CHAR etc), a description, but the real treat is
the Sample Data button at the bottom of the page.
That’s going to show you live output (unless the content is blocked for security reasons) and you can
better understand how to use every particular column.
Types of data
18
When querying DW you will most commonly find 4 types of data. Later on, when we will talk about dates
and using multiple DW tables in a single query, the importance of these data types will become even
clearer.
CHAR (followed by a number – CHAR(10)) – this data type is of fixed length. It is used in DW for fields
such as ASIN or EAN, where the number of characters will always be the same. This data type uses both
letters and numbers
VARCHAR – it is the same as CHAR data type, only that it has no fixed number of characters. This can be
used for almost any data, from customer reviews to customer_ids.
NUMBER – unlike CHAR AND VARCHAR, which use the string format and can hold alpha-numeric values,
NUMBER will only have…well…numbers. You can do calculations on this field, such as
additions/subtractions/averages etc.
DATE/DATE TIME – this data type is used solely for time stamps. Again, it will become more obvious how
this is helpful once we go further into this training.
Exercise
Now that you know all these, find a Booker tables from which you can extract the ASIN, the
MARKETPLACE, the REGION and the ITEM NAME. Specify the name of the table you find, the ~ number of
rows it has, what the table prefix means.
Now build the query to include the above elements, and make sure to only get unique rows, and before
you run it, we’re going to talk about the explain plan.
So, you’re done building your query, you have built a profile for it, and you are ready to run your profile.
What if you have a bug in your query? Might not make sense now but as your queries will get more and
more complex, it will be pretty easy to miss a comma, a keyword, miss-spell a column name etc. If you
click run for a broken piece of code, it will error out within minutes.
Let’s validate the code. Click the Explain and Analyze Query button, and after it, choose the option
Explain and Analyze.
For the sake of this exercise you’ll run the explain plan after you will have saved the query . You can also
run the explain plan while you are editing the SQL code, which is the best option since you can make
changes and correct whatever is wrong without having to save and reopen the SQL Profile code.
19
Generally, once the Explain Plan has run, this is what you want to see:
‘This statement conforms to all rules’ is the Holy Grail, it means you can run your code as it is. When you
run an explain plan, there are 3 fields you want to keep an eye on (highlighted above), which are:
- Rows – it will give you a rough estimate of how many rows your query will scan to retrieve the
information you need.
- Bytes – shows you the temp space your code is using.
- Cost – an estimate of the strain your code will have on datanet. Generally I found that it is linked
to ‘Rows’ and ‘Bytes’ and that it’s best you keep it under 1.000.000.
Please remember that the above indicators are on the spot rough estimates of the cost your query will
have and should be treated as such!
So you’ve run your explain plan and your explain plan is not the same as the one above. It actually looks
more like this
20
‘The Query Analyzer found issues with your query/queries.’ is a clear indicator that something’s wrong.
The number of Rows, the Bytes allocation and the Cost have sky rocketed! Plus you also see that this
current code will return > 10 million rows. As it is, even if it does run (which it probably will since it’s a
pretty simple query) you couldn’t use any of the results.
2. WHERE clause
The WHERE clause is the Excel filter of SQL. If I were to ask you to find me “red pen” somewhere in the
office area you’d spend a lot of time looking for pens. By SQL logic you’d return with all the red pens on
the floor, I’d have to sort through the pens to find the one I wanted. But if I tell you to get me the red
pen(s) on your colleague’s desk that would save us bot a lot of trouble. WHERE works much the same
way.
So, what can we add to the WHERE clause to reduce the results? The option are rather wide, but most
times you will be limiting your results to specific column values.
Let’s start with the query above, for which we saw the huge costs. Let’s limit the results to the EU region
(2) and the ES marketplace (44551). We can list any number of restricting conditions.
21
We still see that the rows returned are > 10 million (it’s understandable, you’re asking for all the ASINs –
retail or 3P – which are in the Amazon Catalog in DE), but the costs (rows scanned, bytes used) are a lot
lower. So adding these 2 filters already improved the query.
Remember!!! The multiple conditions you use in your WHERE clause have to be separated by ‘AND’ , as
opposed to the items in your SELECT clause, which are separated by a coma - ‘,‘!!!
Think of AND in SQL as the ‘=AND(statement 1, statement2)’ of Excel. You can also use OR, and it too
works exactly like the OR in Excel.
This example is not the best, but it’s just here to illustrate a point.
What it says is that either the Region is EU OR the Marketplace is
ES, then get the information requested in the SELECT clause. It is
an example of a poor filter because it will pull every record in the
EU region, regardless of the Marketplace.
And, just as in Excel, you can combine the 2. Let’s say you want to
pull the records in Region 2 (EU), when the marketplace is either
IT or ES.
22
=AND(REGION_ID=2,OR(MARKETPLACE_ID =44551,MARKETPLACE_ID =35691)). See how the OR part is
‘nested’ inside the AND part of the formula?
To achieve this in SQL, you need to ‘nest’ the OR statement, just like in Excel. The WHERE clause will look
like this:
What this does is delimitate 2 conditions in your WHERE clause, either the region is EU or the
marketplace is either ES or IT.
We’ll talk later on about the importance of ‘nesting’ statements.
3. Comparison operators
1. ‘=’
In the examples above I used ‘=’. This can be used for both numeric values (e.g. marketplace_ID) or
char/varchar2 values (e.g. ASIN). The difference is that char/varchar2 values will always be used inside
single quotes!
In Excel you can build formulas to apply for certain values you
specify – as detailed above, but you can also apply a formula for
every value except for the ones you specifically label –
=AND(REGION_ID = 2, MARKETPLACE_ID <> 44551). You can also
use ‘<>’ in SQL, or you have another option, the ‘!=’.
Write a WHERE clause in which the region_id is 2 and the marketplace_id is not 35691 (IT).
OR
2. However, for char or varchar2 formats (basically for string formats) it’s better to use LIKE. Again,
for non-numeric values remember to use single quotes, otherwise you’ll get the ‘INVALID
IDENTIFIER’ error.
I used this code, which have given me the above error. On the right side is the correct code.
23
What LIKE has over ‘=’ is the ability to match parts of a value as opposed to the entire value. This can be
achieved using ‘pattern matching characters’, the ‘%’ and the ‘_’.
‘_’ replaces only one character, and just as ‘%’ you can place it
anywhere in the string you are trying to match.
For both ‘%’ and ‘_’, you can use them multiple times in the same string, e.g. ‘B00_0000__’
3. IN operator
What if you want to query for all the ASINs which are in region 2, in the marketplaces IT, ES, FR? Can you
do that using ‘OR’? Yes, but it’s a bit like squashing water. Instead you can use the IN operator and the
benefits will become more obvious once you query for ASIN/subcategory/GL lists.
It’s important to point out that you’ll still need to use single quotes with the IN operator if you query for
a char/varchar2 item list (e.g. ASINs, EANs), like below.
You can also query for records which are not in a list by putting NOT ahead of IN.
24
4. ‘<’, ‘>’ operators
These operators are to be used just like ‘=’. They can be applied to numeric fields (like marketplace_id)
but also to char/varchar2 fields (ASIN) where ‘>’ will mean the next value in alphabetical order will get
picked up.
5. BETWEEN operator
BETWEEN does basically the same job as >= and <=, but it has the added value that you don’t need to
write the column name twice. It will also become a lot more obvious why this is an important operator
later on, when you will work with dates.
4. ORDER BY
ORDER BY is to SQL what SORT is to Excel. From a syntax
standpoint, ORDER BY is placed after WHERE!!!
25
If you want the results to be ordered from Z to A just add DESC after the criteria for ordering
You know how in Excel you have the advance sort, where you give more than one criteria for sorting your
data? Of course you also have this here, only that it’s easier to use that in Excel
Later on, when we’ll talk about the DECODE statement, you’ll see how to sort by custom values you
choose.
5. USING COMMENTS
Right now you know how to write basic code. It is pretty easy to read the code you have until now, I
don’t think you can be in doubt what putting MARKETPLACE_ID, ASIN and REGION_ID in the select might
mean. But, I stress this out again, your queries will grow in complexity. If you will be reviewing a complex
query 2 or 3 months after you had written it, it will take you some time to figure out what everything
does, what the logic is, why it is written the way it is. This is what comments are for, you can make
comments throughout your query to signal out the purpose of different bits of information.
The comments are not considered by SQL when running your code, they’re there only to help you.
26
There are 2 ways of commenting in SQL.
a. Adding “--“ in front of your comment . SQL will disregard all information after “--“ as long as it is
on the same row – check the comment in the WHERE clause after the Marketplace_id filter
b. Enclosing the part you want to comment in /* and */ . This is useful when you have a bigger
portion of code to comment – check last 3 rows of the query
But there is another use for the comments. Look at the SELECT clause, what do you notice?
I have used the comments to eliminate the gl_product_group column from my select clause, but I did not
delete it. I did this because I believed it was important for me to know that I also used to pull the
gl_product_group column.
Basically, using comments can also act as a method to preserve previous iterations of my code.
6. Practice
1. Find a table which has the ASIN, subcategory information, both subcategory name and
subcategory code, the marketplace, the region, the GL code. Make sure it’s a BOOKER table.
2. Write a SQL query to extract the fields above only for only for Germany (in the Marketplace_id
column description you’ll find a wiki with all the MARKETPALCE_IDs and what country they
represent).
3. Only pull unique rows.
4. Limit the results to GLs 60, 75 and 79, and write this filter in your query in 2 different ways, using
different relational operators, and in one of the cases use ‘OR’.
5. Rename 3 of the columns.
6. Add a comment to the GL filter in the WHERE clause and mention the GL name for every
GL_code you used (useful wiki with all GL codes and names here).
27
7. Order the results by GL and MARKETPLACE from smallest to largest and Subcategory code from
largest to smallest.
8. Validate the SQL plan.
9. Run your query to pull results only for the following ASINs: B00TFGWAA8,
B00UNA1O0W ,B00TFLQ2V6 ,B00ZPEAFXI.
10. Run an explain plan and run your query.
11. Explain the blank rows outcome.
Useful information
28
III Partitions, aggregate functions, group by, DATES
Recap exercises
1. PARTITION USAGE
We talked about filters and how to use them, how to use relational operators to limit the results. You’re
queries will reach a level of complexity where you’ll want them written as efficiently as possible. Limiting
the results only to the field values you need may not be enough, you will need to take partitions into
consideration.
Range Partitioning makes queries against large tables much more efficient. A table that is range
partitioned by one or more column is virtually (and even physically via disks) split into chunks called
partitions – one for each distinct value or range of values in that table. For example, if the table
D_MP_ASINS was partitioned by the column REGION_ID, it would be virtually split into three partitions –
one that includes all the records for REGION 1, one for REGION 2, one for REGION 3 etc. A query against
the table with REGION_ID = 3 in its WHERE clause would look only at the portion of records associated
with REGION 3, and not bother scanning the other two partition chunks – thus making it at least 3 times
more efficient. If a table is not partitioned, each row of the table is checked against the conditions in the
29
WHERE clause, one by one. If a table is partitioned, only the rows in the partition that match the WHERE
clause conditions on the partitioned columns are scanned, thus saving time and resources.
Think of the DW table as a normal Excel table. Now imagine that for every partition you see in a DW
table, you have a separate sheet in your Excel file.
Remember, you can see the partition information in BI metadata, at table details.
For now, this is enough knowledge about partitions, but we’ll talk about them more in the coming weeks.
2. DATES
While working with Data Warehouse tables, you’ll find two types of DATE columns: DATE columns that
are truncated to only the Month, Day, and Year information (e.g. 12/31/2008), and DATE columns that
also contain the Hour, Minute, and Seconds (e.g. 12/31/2008 08:13:52) – known as the DATETIME
format.
I have never found an instance where I needed DATE information down to the second however I have
seen tables where the only time reference table was DATETIME. DATETIME requires a special syntax,
we’ll cover that later in this chapter. So we will focus on the DATE column type. Why is the DATE data
type so important, what exactly can you use it for?
A timeframe gives meaning to any report. If I want to analyze the sales product ‘X’ had during peak
season (e.g. 2 weeks before Christmas) relative to the rest of the rest of the year, I will need timeframe
information for the records I pull from DW.
How do you identify DATE type columns in DW? The same way you check column information and table
contents, you use BI-metadata information.
30
And in sample data you can see exactly what the column contents look like.
In the case of Activity_day, the information stops at day level, with no further details on hour/ min/
second. That’s because the format is DATE and not DATE TIME.
What exactly can you do with the DATE besides just referencing it? We’ll turn a bit to data conversion for
this examples.
When writing SQL queries, you may find you want to control the format of a date column in your results,
so you always know what format it will be in and so there is never any question of exactly what the date
means. To do this, we use the TO_CHAR() function, which converts the DATE to a character string , in a
format specified by you.
31
Let’s pay a little attention at the syntax used. Notice that I pulled the column “ACTIVITY_DAY” and then I
got several conversions/bits of information from it. I converted the data type from DATE to CHAR.
Whenever you convert one data type to another, always reference the new data type at the beginning of
the row. After you mention ‘TO_CHAR’ you have to enclose in parentheses the name of the column you
are converting, followed by a comma, and then, within single quotes, the format you will be converting
the data into.
See column ‘ACTIVITY_DAY’ for the format the datanet table I have queried uses to store the DATE
information. It goes day, month (as abbreviation) and then year (as the last 2 digits of the year). In this
format, we can’t really be sure what the year is, could be 2016, could be 1916 etc. Using TO_CHAR (see
the SELECT above) I have made the date easier to read (,TO_CHAR(ACTIVITY_DAY,'MM/DD-YYYY')), I have
pulled the number of the day (,TO_CHAR(ACTIVITY_DAY,'D')) and what the day actually was
(,TO_CHAR(ACTIVITY_DAY,'DAY')).
This is pretty neat but there were more elements in the above SELECT, which allow for even more
customization of the DATE.
32
I have the year written in letters (,TO_CHAR(ACTIVITY_DAY,'YEAR')), the year written in letters and with a
numeric reference (,TO_CHAR(ACTIVITY_DAY,'YEAR-YYYY')) and more complex construct at which we’ll
look separately.
What I have done here is to add bits of string format. Notice that the syntax is almost the same, start
with TO_CHAR, the format I want to pull is still enclosed in single quotes, but, the string bits I have used,
which are not recognized DATE elements (like D or YEAR or YYYY or MM or MONTH etc are) have to be
enclosed in double quotes. Sure it looks like overkill going into such a great level of detail, but further
along, a trick like this will spare you some extra Excel work processing the DW output.
Why is it worth mentioning that after using TRUNC we’d still end
up with a DATE format? Because you can do calculations on a
DATE format, compare dates, use comparison operators.
33
The syntax is the same as with TO_CHAR, so let’s go through what each TRUNC function does.
- DDD will give you the current day from which the data was taken.
- D will give you the first day of the week from which the data was taken.
- Y will give you the first day of the year.
- MM will give you the first day of the month.
- Q, the first day of the quarter.
- CC, the first day of the century.
What my filter tells DW is to basically look for all records associated with the 2nd day of the 13th month,
of course it will error out.
Now we’re getting back to the beginning of this chapter where we needed to figure out what sales an
ASIN had 2 weeks before Christmas. We’re going to go with OPS (ordered product sales) – we’ll cover
economic indicators later one when we get to the project/initiatives monetization parts.
34
The syntax is pretty self-explanatory. You’re giving SQL the outer limits of the time interval you input, it
does not matter in which order you write your border values of the selected time frame.
BETWEEN comes in handy when having to query a DATE TIME column. If the date field you put in your
WHERE clause is a DATE TIME (e.g. 2016/02/17 06:29:13), using ACTIVITY_DAY =
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) will return very few or no rows. Essentially, you would be asking
DW to get all the results associated with the 2016/02/17 00:00:00, the probability of having any
information for the DATE TIME being specified to the second is very low, in any case you’ll miss most
rows.
a. Instead of using a one day reference you can add an interval (e.g. ACTIVITY DAY BETWEEN
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) and TO_DATE(‘2016/02/12’,’yyyy/mm/dd’).
b. TRUNC the DATE TIME to just DATE level (e.g. TRUNC(ACTIVITY_DAY) =
TO_DATE(‘2016/02/13’,’yyyy/mm/dd’) )
Both options work but option a is the better one for large tables, it’s a lot more efficient. The reason for
that is that by using BETWEEN the query would limit the results to just the timeframe mentioned by you.
Using TRUNC the query would still run against every row of the table which will be very costly for the
system.
You can use additions or subtractions with dates and the effect on their own or combine them with
BETWEEN.
35
OR
The ‘<’ or ‘>’ is the easiest comparison operator to use when you’re looking for data from a certain
point in the past till today.
ROUND( date , format ) – used to round a date up or down to the nearest day, month, year, etc.
ADD_MONTHS( date , number of months) – used to add (or subtract) months from a date
LAST_DAY( date) – used to determine the last day of the month the date falls in
NEXT_DAY( data , weekday ) – used to find the date of the next day following the date specified of the
weekday specified
MONTHS_BETWEEN( later date, earlier date) – used to determine how many months are between two
dates
36
The results look like this.
If you look at the rows highlighted in yellow, you’ll see the same row 4 times. How can I make those 4
rows into just one, because, imagine you query for thousands of ASINs at a time, getting an extra 4
rows/ASIN will make it hard for you to process the data in Excel?
We’ve already covered DISTINCT, which would do the trick here, but what if I want all the sales in a given
day for one ASIN?
3. AGGREGATE QUERIES
Aggregate means a grouping of multiple rows in just one row.
You have already created aggregate queries when you used the DISTINCT function. By using DISTINCT,
you grouped more identical rows into just one, making the results more easy to use.
MEDIAN – which gives you the middle value from an ordered list of values – we’ll cover this later as it
takes a special syntax
Aggregate functions are placed in the SELECT part of the query. The syntax is as follows: the function
name (e.g. count) followed by the field you want the aggregate function applied to inside parenthesis
(region_id). In this case it would look like this: count(region_id)
37
Let’s take the first 5 aggregate functions and give them a run.
Above you can see the results. We know that the SUM of the MARKETPLACE_IDs is comprised of 26
values. This information is given to us by the COUNT. We see the MIN and MAX values, but we already
knew them since the MARKETPLACE_IDs in EU are UK – 3, DE – 4, FR – 5, IT – 35691 and ES – 44551.
You’ll notice that given all these, the MAX is not 44551, why is that?
This means that for the ASIN in the WHERE clause there is no record for the ES MARKETPLACE. That ASIN
does not exist in this table (which is not necessarily an indicator of whether the ASIN is released or not in
ES, or if it has a retail or 3P offer).
What is the name of this table? What information do you think it stores? The table name helps us out a
lot, it’s called D_DAILY_ASIN_ACTIVITY. So this table stores ACTIVITY information at DAILY level. Based on
this, can you say if there is an offer (of any kind) for that ASIN in ES? We can’t deduce that from the table
results, all we know is that for the ACTIVITY_DAY queried (11th December 2015) there was no ACTIVITY
(shipments).
And why since there are only 5 MARKETPLACE_IDs in EU, the COUNT is 26?
These functions, they do not pull distinct values. They pull every instance of the conditions specified in
the WHERE clause. In this case, we know there are 26 DISTINCT references of the ASIN B003Y3M4N6 in
D_DAILY_ASIN_ACTIVITY in REGION 2 – EU on December 11 2015.
What if I don’t need this information, and I want these aggregate functions done only on DISTINCT
values?
38
You can use DISTINCT within a function and it will do the same thing DISTINCT does when you put it at
the beginning of the SELECT clause; it will only pull DISTINCT values.
This is the syntax, and it is the same way when you use it to only
get DISTINCT rows.
Let’s compare the results from the SELECT statement which did
not use DISTINCT on the aggregate functions and the results
from the code on my left.
Without DISTINCT
With DISTINCT
Notice that the COUNT and the SUM are different values, but the AVERAGE is the same. That’s because I
did not use DISTINCT on AVG (not because SQL restricts this). The point is that using DISTINCT in one
aggregate function will not automatically expand to the rest of your functions.
As I have said before, we can use COUNT, MIN and MAX on non-numeric data types.
GROUP BY
Using the SELECT above, and looking into the results, I get the general information. I know I have 15
ASINs which appear in the D_DAILY_ASIN_ACTIVITY table a combined 582 times for the date of
39
December 11 2015, but I don’t know what ASIN appears how many times. Plus, using the
MARKETPLACE_ID is not the best example in terms of relevance.
Let’s take the same SELECT again, once using aggregate functions and once without. We’ll turn to a
finance field once more to emphasize the relevance of aggregate functions.
This is the classic SELECT and you already know what the output looks like. You have to use an Excel Pivot
table to use the results, a lot of work. Plus, the results are going to be huge (I have stopped at the 8th
row but it went on and on).
You are basically telling SQL to sum up the OPS (ordered product sales) data, but you are not giving it a
criteria to sum up by. Aggregate functions work the same
as an Excel pivot. If in the values section of the pivot you
have the sum of OPS and nothing else in the entire pivot,
you will see a sum of all OPS without any other indicator,
just like in the examples with the MARKETPLACE_IDs.
40
More importantly, what the pivot just did was to group the OPS data by a criteria, which is the ASIN.
HAVING CLAUSE
Once you begin aggregating, you’ll find that you may want to limit your results to records where the
result of an aggregation meets a certain criteria. For example, we might only want to look at OPS for
ASINs with sales in more than one Marketplace. We can’t do this in the WHERE clause, because the
conditions in the WHERE clause are evaluated before we aggregate.
Syntax wise, HAVING is placed after GROUP BY! Going back to the full list of 15 ASINs, let’s assume I am
interested in the underperforming products. I already know that it is possible for one product to account
for over 6k€ OPS, I have this info
above, but I know there are other
ASINs which do not sell as well. I want
all ASINs which for that date are
responsible for less than 300 €.(this
threshold is not an indicator of any
kind, it’s a random value to
demonstrate a point)
41
as in the SELECT clause, multiple elements are separated by a comma in the GROUP BY clause, and not by
AND, as you use in the WHERE clause.
This is what my table looks like. Exactly like an Excel pivot table. Just as in the WHERE clause, multiple
filters are separated by AND. It’s worth pointing out that you don’t necessarily have to have an aggregate
function in the SELECT clause in order to reference it in the HAVING clause, just as you don’t have to
have a column in your SELECT clause in order to also add it in the WHERE CLAUSE.
42
And of course my results are
now limited and I get
information for just one ASIN
out of two.
4. PRACTICE
1. Find a table (other than D_DAIILY_ASIN_ACTIVITY) which contains a financial indicator (e.g. OPS,
revenue, contribution profit, GMS etc).
2. Find the partition details and also find the the relevant DATE column. Select any 4 day interval to
query.
3. Delimitate the 4 day interval you want to query in 3 ways, using 3 different comparison
operators.
4. Pull the current day of the week (both number and name), the week number and the first week
day along the DATE column.
5. For this set of ASINs (B0088PUEPK, B003Y3M4N6, B0093RMTY6, B00FOKN7D8, B00G7LQA1E)
pull the ASIN, the MARKETPLACE (or similar information the table you chose has), the day the
metrics were captured and use which ever aggregate functions you find relevant.
6. Limit the results to ASINs which have information for the given time frame in at least 4 MPs.
IV Joining tables
43
12. What new CLAUSE do I have to use when I use aggregate functions?
13. Why do we need to use GROUP BY with aggregate functions?
14. How do I limit the results of a query using an aggregate function (e.g. sum(field) > 20)?
15. What is the order of the 6 clauses we have used so far?
16. Give a column example for each data type.
17. Use 3 comparison operators in an example.
18. What is the retention policy of a table?
19. What do you need in order to run an already written query?
20. Name all the comparison operators.
1. JOINING TABLES
By know you have learned how to write basic code, how to browse for table and column information,
how to use that information, how to limit your results to just the ones you need, how to select a time
frame for your data and how to make the most of date usage and how to use aggregate functions.
So far we have been limited to information stored in only one table. Tables can hold over 100 distinct
columns, which is a lot, but given everything that is happening in Amazon and all the specific reporting
needs, even in 100 columns you will not find all the information you need.
What is the solution for this? You could query one table at a time, get a part of the information you need
from one query, a part from another query and then compiling that information in Excel, but that will
take extra time and effort. The way SQL answers this problem is by giving you the ability to join 2 or more
tables.
To be more precise, let’s say I need to know the SUBCATEGORY for a set of ASINs, among other
attributes. I have found that D_MP_ASINS is the table which stores all the information I need, or almost
all. By querying D_MP_ASINS, I can get the subcategory_code, but that will not tell me too much about
what that subcategory holds. I would need a subcategory description/name, but that field is not in
D_MP_ASINS. I have an abundance of tables from which to choose.
44
From experience, I will choose D_DAILY_ASIN_GV_METRICS, where I know I have the column
“SUBCATEGORY_DESC” as well as “SUBCATEGORY_CODE”.
We need this output: ASIN, item_name, GL, subcategory and subcategory description.
There is a lot of new information in here, so let’s try and split this
query up into chunks. Starting with the SELECT you will notice
that every column I pull has a prefix and that same prefix is
placed after the table name, in the FROM clause.
When you join 2 tables, SQL has no way of knowing from which
table it should take the information in your SELECT clause. What
this query says is: get the ASIN, the ITEM_NAME, the
GL_PRODUCT_GROUP and the SUBCATEGORY_CODE from the
table D_MP_ASINS (notice that the column prefix is ‘dmp’ and
the table is renamed or aliased dmp), and the
SUBCATEGORY_DESCRIPTION from D_DAILY_ASIN_GV_METRICS (with the column prefix being ‘gv’ and
the table being renamed the same way).
‘dmp’ and ‘gv’ are called TABLE ALIASES. So basically, what you are doing is renaming the table. This is
not mandatory, you can use the full table name as a prefix (just use a way in which you tell SQL what
table to use for each column), and it would look like this:
It’s not the practical way of using the table aliases, but it will do.
Now that I have explained the TABLE ALIAS, let’s move to the FROM
clause. After ‘From d_mp_asins dmp’ you see another table
referenced, with the keyword ‘join’ placed in front of it. This is the
syntax for using multiple tables in the same query, you must
reference all tables. There is a main table (d_mp_asins) and a
45
secondary table (d_daily_asin_gv_metrics). We’ll cover a bit later the keywords ‘join’, for now, let’s
continue with the code.
After mentioning the 2 tables you have to specify to SQL how to join them and what the common ground
is. You are asking SQL to get you info from 2 tables for the same set of data. But SQL does not know that
it’s the same set of data. It does not know that the ASIN column in D_MP_ASINS has the same
information as the ASIN column in D_DAILY_ASIN_GV_METRICS, so you have to specify it.
Whenever you join multiple tables, you have to use the partitions each table has. How are the 2 tables
partitioned? D_MP_ASINS is partitioned by MARKETPLACE_ID and REGION_ID,
D_DAILY_ASIN_GV_METRICS is partitioned by REGION_ID and SNAPSHOT_DAY (This is information you
can easily find in BI-Metadata as explained in chapter II).
Above you can see how all partitions are used. The 2 tables are not joined on region_id, is this partition
taken into consideration? While the 2 tables are not joined on region_id, this join is implicit through the
marketplace_id join. Marketplace_id 3 is one of the 5 marketplace_IDs in the EU region (region_id = 2)
and marketplace_id is used in the WHERE clause.
JOIN SYNTAX
The syntax described above is what you’ll see in most SQL queries. However, this is a newer syntax,
before that, the tables were separated commas in the FROM clause while the join criteria (the columns
on which the tables were joined), was
mentioned in the WHERE clause.
46
JOIN TYPES
There are more ways of joining two tables, each with its own advantages and each being more suited to
particular queries.
The join used in the examples above is a full join, also known as an inner join (you might even see in
some queries ‘inner join’ instead of just ‘join’ being used, but it’s the same thing).
This is a Venn - Euler diagram. The A ellipses is table D_MP_ASINS. The B ellipses is table
D_DAILY_ASIN_GV_METRICS. What the INNER JOIN does is to take only that which is common from
both tables and place it in your output.
In our query, we are getting the ASIN, item name, subcategory_code and GL from D_MP_ASINS and the
subcategory_desc from D_DAILY_ASIN_GV_METRICS. Let’s say that in D_MP_ASINS, within the
subcategory ‘14700720’ I have the following ASINS: B000000000, B000000001, B000000002,
B000000003 and B000000004. However, in D_DAILY_ASIN_GV_METRICS, the ASINs present in the table
for the SNAPSHOT_DAY I used are B000000000, B000000001, B000000002 (I want to emphasize again
the specific of each table, D_DAILY_ASIN_GV_METRICS will, by the purpose for which it was created, only
stores the ASINs which have glance views on a given date.).
47
ASIN Item_name Gl_product_group Subcategory_code Subcategory_desc
B000000000 Item name 1 147 14700720 Internal hard drives
b. OUTER JOINS
The INNER JOIN takes whatever is in the middle of the two tables, the green section above. The outter
joins take the other sections. What the outer joins are good for is that they will not limit the results of
the query to only what is common between the queried tables.
- LEFT JOIN
Going back to the diagram, the left join will get all the results in A (D_MP_ASINS) + whatever matches
from B (D_DAILY_ASIN_GV_METRICS). In our case, it means I will see even the ASINs which are not
present in D_DAILY_ASIN_GV_METRICS (they have no glance views for the chosen SNAPSHOT_DAY).
Notice that for ASINs B000000003 and B000000004, there is no Subcategory_desc information. That is
because, as we have previously seen, those 2 ASINs do not appear in D_DAILY_ASIN_GV_METRICS for the
queried SNAPSHOT_DAY.
- RIGHT JOIN
Being an outer join, the RIGHT JOIN does the same thing as the LEFT JOIN, only that the main set of data
is pulled from the table on which you RIGHT JOIN. It’s basically a reverse LEFT JOIN. To illustrate this,
48
let’s say that apart from the 3 common ASINs between D_MP_ASINS and D_DAILY_ASIN_GV_METRICS
(B000000000, B000000001, B000000002), the latter table also has ASINs B000000007, B000000008,
B000000009 for which there is no record in D_MP_ASINS.
The syntax is the same as for the LEFT JOIN, just write ‘RIGHT JOIN’ instead.
Given the data above, this is what our export would look like.
These results looks exactely like the INNER JOIN results? Why is that?
Let’s analyze the syntax closely. You are asking for the subcategory_code from the D_MP_ASINS
(dmp.subcategory_code). Regardless of the join type, you are limiting the results of this query to
whatever records are in D_MP_ASINS. How can this be avoided? You either SELECT the data FROM
D_DAILY_ASIN_GV_METRICS first and then LEFT JOIN it to D_MP_ASINS, or, better yet, you keep the
same RIGHT JOIN condition you have now, but you make the Subcategory_code one of the join criteria.
This is a good example of what joining on the right columns can mean to your query results.
49
There is a third type of outer, the FULL OUTER JOIN. Think of this join as the opposite of the INNER JOIN.
If the INNER JOIN extracts information only for the common ASINs in our case, the FULL OUTER JOIN
extracts all information between the two (or more tables). If I would outer join the above tables on
column ASIN, the result would be all an extract which contains all ASINs in table d_mp_asins + all ASINs
in table d_daily_asin_gv_metrics. This type of join is an extreme measure, I have not found an issue that
required this join type so far, but it might be useful one day.
Always be very careful of what you add to your WHERE clause because it might turn the OUTER JOIN into
a full INNER JOIN!!!
50
I am still rellying on D_MP_ASINS to provide the main bits of information in this query.
As your data needs become more complex you will find yourself in the situation where not even 2 or 3
tables contain the information you need. Even though in the example below, joining 4 tables together
looks easy, in fact you may want to pay special attention to how each table is built, how it is partitioned,
what columns are nullable etc.
1. SUBQUERIES
Subqueries are SQL statements nested inside other SQL statements, like a query within a query. Think of
a subquery as a temp table you are creating, a table of which the results are discarded as soon as your
query is done running. When we’ll take a look at how to optimize a query, you’ll see a very efficient way
of using subqueries which will do wonders for the runtime of your SQL code.
Subqueries can be used in the FROM clause (which I recommend the most) as a table to JOIN on, or in
the WHERE clause to limit your results (which is very cost inefficient).
51
The way the query on the left is built is that it gets a list of all ASINs in EU for which this flag is set
(regardless of MARKETPLACE info, just an ASIN list) and then it gets the GVs, MP and GL info for each and
every one. This means that if ASIN B000000000 has the “IS_VERY_HIGH_VALUE” flag is set to YES in DE
and FR, but not in UK/IT/ES, I am still getting information in those locales as well. By using the code on
the right I am limiting my results to just the cases in which the flag is set at MP level.
The relevance of this example is that I know that the value for this attribute should be the same in all EU
countries, but I also know that there are data quality issues with DW, this is a hack to get by one such
possible IDQ issue.
As I have mentioned, I can use that subquery to limit my results by placing the subquery above in the
WHERE clause.
A subquery runs basically like a query, but it stores its results in a temp space, creating a new table in the
process which will be discarded once the query is done running.
It is a lot more efficient to use subqueries when you have multiple tables you need data from. Subqueries
are the first step towards using re-usable code in your queries. Let’s face it, with most queries you’ll
build, you are not going to re-invent the wheel. More so, you may find you have a common denominator
between all the queries you build, meaning you will have a block of information you will always pull. In
my case, almost every query I build has glance view information, because that is a good indicator of how
popular an ASIN is and it also allows me to prioritize which ASINs to fix in case there is an issue
somewhere, or each ASIN to deep dive on in case of a wider problem.
52
Now, whenever I need glance view information in my query, I just have to copy and paste a subquery
instead of writing all that code in my query and figuring out on which tables to join, how to join etc.
2. SEGMENTS
Up until now we have always queried for either a small group of ASINs, limiting our results to only the
ASINs in the where clause, of without giving the query any particular ASIN, instead we gave a general
field for a search, such as all the ASINs in MP Spain which belong to GL Camera and have an Amazon
offer.
But what if you have to get some information for 20.000 ASINs at once? You basically have 2 options:
a. Insert them manually in the SQL statement (this is called ‘hard coding’). This may work for 4.000
or maybe even 20.000 ASINs (depending on the complexity and the efficiency of your query) but
that will not scale for 100k ASINs.
b. Load the ASINs in DW and reference them in your query through a segment .
53
The way you do this is by loading a segment in DW.
54
Basically, your segment is now loading on every data cluster. Once the segment is done loading on a
data cluster, you will no longer see it in that view. But you will be able to see it in My Segments.
55
How do I reference my Segment in a query?
By using a subquery, of course. It’s not the only option but it’s the easiest. All the segments you create
are stored in a DW table which you can then query. For ASINs that table will be
PRODUCT_SEGMENT_MEMBERSHIP.
First off, you can load more than juts lists of ASINs in a segment. Depending on what you need you can
load EANs, UPCs, vendor codes, customer IDs, postal codes, merchant IDs and so on in a segment. If you
have a list of elements which you know is a constant (same list of vendor codes, same list of ASINs), using
a segment and referencing them that way is a lot more convenient as opposed to hard-coding the values
into your query which will not only increase the strain on the query, slowing it down, but it will also make
it hard for you or any other user to read and understand your code, given that you will have to sort
through countless rows of endless ASINs.
3. WILDCARDS
We’ll now get to see some of that extra ETL functionality I talked about these previous courses. By using
wildcards you can reference certain information in your query without hard-coding it. Think of the
wildcard functionality as being similar to that of a segment.
56
The date you select here is the same date which
is referenced in the query above. Imagine you
have a daily report for which you need data.
Normally, you’d need to change your code
every day and update it with the latest dates.
The run date wildcard leaves this task useless,
as everyday your query runs it already looks to
the run date for information on the time frame.
Basically, all the tricks we learned for DATE manipulation can also be used with the run date wildcard.
Keep in mind that just as the rest of the wildcards, run date is an ETL function, so it is not supported by
SQL developer.
57
Unfortunately, you can’t use all 3
partition types at once, so you have to
choose between the 3. But this is not
actually that big of an issue as
marketplace_id and legal_entity_id
mean pretty much the same thing, it’s
just a matter of which of the columns
your table contains. As for REGION_ID,
as long as you have marketplace_ID
you can use the EU values to mimic
partitioning by REGION_ID.
This is
where you
can view
the partition
information
you chose.
Look at the
example on
the left and
notice that you are not limited to just one value. In this case I chose the UK, DE and FR marketplaces.
Be very careful of what type of data you use in your free form tag. ETL reads the free form tag just as if
the information you’re referencing was hard coded into the query, meaning if you had CHAR or
58
VARCHAR2 information type you’d need to enclose those bits of information in single quotes. A few
examples are vendor codes, brand codes, ASINs etc.
One more thing worth mentioning is that using the free form and
writing it in the query in upper case (e.g. ASIN in ({FREE_FORM}))
means that whatever info you have in your free form tag will also be
converted to upper case.
4. PRACTICE
1. Load the following ASINs into a DW segment.
2. Build a query which, for the segment you have created, extracts the glance views, the
marketplace, the GL, the item name and the brand name – use as many tables as you have to.
3. Using the {RUN_DATE} wildcard, make your query run for the last calendar month.
4. Use the SEGMENT_ID in a subquery.
5. Reference the SEGMENT_ID using the FREE_FORM tag.
6. Limit your results to marketplaces 3, 4 and 5. Do this by using the {MARKETPLACE_ID} wildcard.
7. Extract the above information only for ASINs having over 10 glance views for the given time
frame.
8. Rank the ASINs by glance views so you get the top ASINs at the top of your output.
59
VI Publishers, Decode, Case when, Lower, Upper, Partition by, Coalesce
We’ve covered basic query writing so far. You’ve learned how to search for data, how to identify the DW
tables you need, how to query for the data you want, how to use information from multiple DW sources,
how to aggregate information, how to use subqueries, how to use the ETL functionalities like segments,
wildcards, job scheduling, notifications on query error or success. You have all the basic knowledge to
write simple to advanced ETL queries, but there are still a lot of things which can make your life a lot
easier when writing code.
1. PUBLISHERS
The publisher of a query is another functionality ETL offers. Setting up a publishers allows DW to send
you the results of a query on the email or publish them on a sharedrive. Why is publishing the results of a
query a big deal? First off, it saves a lot of time you’d be losing by manually downloading the results
(having the results sent directly to you via email is very convenient). It becomes especially useful when
the results of your query are really big (exceeding 30Mbs) and manually downloading the results can take
a while. Plus, having the results automatically publish to a location is the first step towards automating
processes/ reports.
Go to DW at the query
page and look to the
Publisher section. Click
New and this will open a
drop menu asking for the
type of publisher you
want.
60
Just choose email from the drop down above,
which will automatically bring you to the
window on the left. Here you can write a
description, or not, it’s not mandatory. Leave
the encoding as is but set the format to Text.
Input the email address in the email addresses
box; additionally you can also chose to have the
results sent as a Zip attachment, which is a
good idea, else you will get the entire output in
an email.
This is a bit trickier, as it requires more than just writing an email address. The Description is the same as
above, you still should chose Text as format and leave the Encoding unchanged, but you’ll have to input
the address of the file. Be very careful when doing this as adding an extra space, even at the end of the
address will results in DW not recognizing the path and giving you an error when publishing.
Also, DW will create the file you want it to, but it will not create the path (the folders leading to that file).
You also need to be very careful and remember to set an extension for the file you’re setting for the
publisher. E.g. \\ant\dept-eu\IAS2\RBS-RO-IAS2-NIS\EFN Flex Tasks\Daily EFN Uploads\EFN Removal\
New.txt.
I’ve done all the steps above and still I get an error when publishing, why is that?
There is one more thing you need to be sure of, that DW has the permissions to write files on the address
you chose. In the folder you want to publish the file, right click in it and go to Properties Security
Advanced. In the window which pops up you’ll see all users authorized to publish content.
If you see the user DW-Metron or D. W. Metron, that’s good news, it means Data Warehouse has rights
to publish to your location. This will not always be the case, so you may have to add the user yourself (if
you have permissions to do that) or contact your local IT department to add the user for you. In the case
you can do it yourself, here’s where you need to go to.
61
Clikc Change Permissions Add… Advanced and in the name box search for the user DW-Metron
OK and give DW-Metron the following permissions:
62
In case you are not authorized to add DW-Metron yourself please reach out to your local IT department
for guidance.
2. DECODE
The DECODE function allows you to recode the output data of a field to into whatever you desire. By
using DECODE, you are basically changing some (or all values) which can be found in one column . An
example of that is recoding the MARKETPLACE_ID codes into more easy to understand tags.
But what about the second decode, aliased IS_UK? It has an odd number of arguments, how can it work?
Well, that decode is working a bit differently because it tells SQL to convert all the ‘7’ values it finds into
the value ‘YES’, and convert everything else to the value ‘NO’.
DECODE is a very useful function and it does a bit more than I have showed you here. Sure, it’s good Cx
to have appropriate data tags in your reports, and DECODE does in SQL what FIND AND REPLACE does in
Excel, but without the extra work, but in the Tips and Tricks session you’ll see another use for this
function.
63
3. CASE WHEN
Think of CASE WHEN as the big brother of DECODE. It is the SQL equivalent of the IF function in Excel,
with the added value that the it is a lot easier to write than IF and it scales a lot better once it gets more
complex.
Let’s look at the CASE WHEN syntax before going into the examples. First off, notice that you can have
one or multiple conditions in a CASE WHEN statement, you can reference one or more columns, and,
based on the needs of your query, you can generate one or more output values.
The second CASE WHEN is already a bit more complex, it is giving a binary output (‘FR inv in DE’ or
BLANK), just as the line of code above it did, but it is looking at multiple column information. Whenever
the 3 conditions I placed in this CASE WHEN the query will return the value I want.
The third CASE WHEN is the coolest, because it looks, again, at more than one column, and it gives me a
more comprehensive feedback of what it finds. What I did was to basically decode the output of the
conditions I entered.
CASE WHEN is a complex function and can grow quite big depending on what information you want to
base its output on. It will eat up more of the system resources and it will make your query heavier, so
64
keep that in mind. A heavy CASE WHEN could be the tipping point of an already complex and resource
demanding query.
When you limit the results of a query to just one value the column can have ( e.g. brand_name = ‘nike’) if
that value you want to match isn’t written in the exact same way in the table, that will not work and it
will return 0 rows.
It’s even easier to use lower and upper along with pattern matching characters. In the output above I see
that the brand name is Bose, but from experience, multiple vendors/sellers can submit a different brand
name for what is essentially the same brand. My brand could have been named in DW as ‘Bose
electronics’ or ‘Bose Audio’ etcetera so even if I would have gotten the lower/upper case right, I still
would have found nothing or I would have missed some records. But if I phrase my filter as such: ‘and
lower(brand_name) like ‘bose%’ ‘ then I increase the odds of finding everything I want.
65
Partition by is a function which allows you to get aggregated results without being obligated to use
GROUP BY. For example I can write the following statement in 2 ways and they both work the same:
There is no difference between the 2 versions so why use PARTITION BY at all? PARTITION BY is an
example of SQL functionality which you can use to further process your results. You basically trick SQL
into doing something that it won’t do using normal functions (group by).
Basically I have grouped the GVs only by 2 parameters instead of everything I had in my SELECT
statement and I also have a numbered count (aliased ‘count1’). I’ll share some examples in the tips and
tricks portion of this course how partition by helps you.
66
One case where it helps you is with the MEDIAN aggregate function. First off, what is the median? It’s the
number separating the higher half of a data sample, a population, or a probability distribution, from the
lower half. When is it best to use the median? When you have an irregular distribution of values which
makes the average less relevant. For example if you had a list 12 vendor costs for the same ASIN, 11
would be around the value of 7$ while the 12 th would be equal to 50$, calculating an average sourcing
price here would not be an accurate depiction of reality as Amazon would not chose the most costly
option when it has 11 cheaper options. If you would present a report showing the average sourcing cost,
you’d see that you can get that product for an average 10.58$, when in reality Amazon got it for around
7$. In this case, the median removes the possibility of data corruption from bad data.
Where in the case above, the average cost is 10.58$, the median cost is 7$, which is also a lot closer to
reality.
You can see the differences easier here. The query above pulls
the average GVs at EU level, the median at EU level and the
median at MP level. Check the results below. Now I have
brought all in this view without using extra Excel functions.
What if instead of glance views you had sales, or GMS, or CP?
6. COALESCE
This SQL function fuses 2 or more column outputs into one single output. That is what concatenating
does, right, why do we need another function for this? While concatenating will bring together the
information of several columns + whatever else elements you desire, COALESCE does something else. It
takes a number of columns (2 or more) and returns the value of the first column which is not blank.
Let me illustrate this with an example so it becomes clearer. The table D_MP_ASINS has 3 columns which
basically hold the same information, the brand name. Those fields are: publisher studio label, brand
67
name and merchant brand name. None of these fields is null proof and it’s generally good information to
know what the brand of the product you’re querying is.
The results look like this. Notice that the field ‘brand_name’ is blank. Had I relied solely on it I would have
missed some important information.
68
VII Tips and tricks
Working with ETL for a while will get you to some realizations about what works, what doesn’t, how to
build a query to scale better etcetera. You can learn a lot from reading other people’s queries,
deconstructing them , seeing how each table is partitioned, how it is joined, generally it’s a good exercise
to expand on your knowledge. Below you will have some easy to use tricks I found while trying to add
more functionality to my queries or make them scale.
You’ll need to watch out for a few things or go through several steps. The steps below, this is my process,
it’s what I do, it’s what I found works for me and makes my work more efficient. However, what works
for some doesn’t necessarily work for others, so take the steps below with a pinch of salt.
What I do is ask myself the following questions, in the sequence presented below:
69
e. What sources of error can I have with my output?
- Logical sources – conditions in my query which counter each other (e.g. region id = 2 (EU) and
marketplace id = 1 (US) )
- Technical – erroneous joins in which I lose information or which create duplicate rows, tables
which do not contain the info you expect etc.
f. How do I test the accuracy of my output?
Presenting inaccurate, irrelevant or just plain bad numbers will though off your entire
argumentation and business case. I had to find this out the hard way and I was really lucky to be
given a second chance to pull the data again.
There are several things you can do in this case to insure maximum data accuracy:
- Analyze output versus expectations – if you’re creating a new report for your team’s productivity
for example, you can easily spot missing users from that report
If no expectations are available:
- Try to refer to other existing tools which do not rely on DW. For example you can rely on CSI to
test certain attributes, you can try Alaska if your query pulls inventory levels, you can use BIW to
test glance view information etc.
- Write your code in different ways and relying on different tables . As stated in the chapters above,
each DW table has its own specific, its own way of aggregating the data, its own different level of
granularity and in the end even different data. Even though it would be safe to assume you’d find
the same information everywhere, that’s rarely the case, most of the times not by error. If you
build your query in 2 or 3 different ways and the results point in the same direction, you’re safe
to use that data.
- Ultimately, test out every piece of your query, break your query into smaller chunks and analyze
the results of every join, the data a subquery and how (and if) it changes when brought into the
main select.
2. WITH CLAUSE
The advantages of using subqueries over normal joins has been presented in chapter V, so I’ll be skipping
that part. I showed you how you can add a subquery in the FROM clause or even in the WHERE clause (as
ASIN in (subquery)) but there is a far more efficient way of using subqueries, which is declaring them at
the beginning of you code using WITH.
70
The syntax is not very complicated, at the beginning of the query you start with ‘WITH [subselect alias] as
followed by the subselect. When you want to add another subselect, define it as the first one, just
replace ‘WITH’ with a comma ‘,’. This is just as listing a set of columns in the select, only that you’re
listing a set of temp tables you are creating.
So why do it this way and not the classic way? There are at least three reasons that come to mind.
a. The query is a lot easier to read because it is well organized. Instead of staring at a big number of
rows in the FROM clause trying to break the different tables and subselects used, how they’re
build, how they are joined everything is well delimitated, you have the tables you are querying at
the top, and the main SELECT is easy to read.
b. For a reason which hasn’t been explained to me but I have seen it happen with queries I did not
hope to ever run, this way of writing the code makes your query very efficient. I have had queries
which I could not run at all, they would always error out due to extensive running time, which,
when reorganized using ‘WITH’ ran successfully in under 30 minutes.
c. It is easy to break apart and test. If you are not sure of how the results look, this syntax allows
you to spot the error easily. You can test the output of one subselect at a time and search for
duplicate rows, output not as you would have expected (e.g. blank columns). It is much easier to
find the root cause of your error.
d. This makes building queries on the go quite effortless. For example, in the query above, the ‘asn’
subselect, which’s only purpose is to provide me a list of ASINs, can be adapted to any query. I
can almost always rely on that piece of code as the part which gives me the selection I want to
run the query for. Since it is already tested, you don’t have to do that when copying it to a new
query.
Here are some other examples of subselects I often use which allow me to build a query swiftly:
71
MARKETPLACE_ID
,asin
,sum(nvl(GLANCE_VIEW_COUNT,0)) as GVs
FROM D_DAILY_ASIN_GV_METRICS
WHERE REGION_ID =2
AND MARKETPLACE_ID in (3,4,5,35691,44551)
AND SNAPSHOT_DAY between TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD') - 7
and TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD')
AND IS_SUPPRESSED_ASIN in ('N')
GROUP BY
MARKETPLACE_ID
,asin )
Subselect for OPS (ordered product sales), units sold and GMS (Gross merchandise sales)
dem as (SELECT
asin
,marketplace_id
,sum(nvl(NET_ORDERED_UNITS,0)) as demand --units sold over x period of time
,sum(nvl(NET_ORDERED_PRODUCT_SALES,0)) as NET_OPS --net OPS over x period of time
,sum(nvl(TOTAL_GMS,0)) as TOTAL_GMS -- GMS over x perios of time
FROM
D_DAILY_ASIN_ACTIVITY
WHERE
REGION_ID=2
AND MARKETPLACE_ID in (3,4,5,35691,44551)
AND MERCHANT_CUSTOMER_ID in (9,10,11,755690533,695831032) - - Amazon EU
Retail
AND ACTIVITY_DAY BETWEEN TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD')-365
AND TO_DATE('{RUN_DATE_YYYY/MM/DD}', 'YYYY/MM/DD') --x period of time
group by
asin ,marketplace_id )
72
,nvl(sum(nvl(case when inv.INVENTORY_OWNER_GROUP_ID = 85 then inv.ON_HAND_QUANTITY
end,0)),0) as instock_ES
FROM
D_INVENTORY_LEVEL_BY_OWNER inv
join o_warehouses wh on wh.warehouse_id = inv.warehouse_id
where
inv.INVENTORY_OWNER_GROUP_ID in (7,8,9,75,85)
and inv.region_id = 2 and inv.snapshot_day =
TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD')
and inv.INVENTORY_OWNER_GROUP_ID =
decode(wh.organizational_unit_id,2,7,3,8,8,9,29,75,30,85) - - joining IOG on ORGANIZATIONAL
UNIT ID to only pull inventory stowed in the owning MP
group by
inv.asin)
Let’s look at this example and then break it off into pieces explaining what is happening.
73
Try not to focus on
that new syntax in the
SELECT. That’s a
function called
substract (it does
exactly what =LEFT or
=RIGHT formulas in
exce do, it gets a
numbers of characters
you specify from a
starting point you
specify)..
Looking at the description in the sample data, I see that the TYPE column values always start with the
country prefix (e.g. UK, CA, JP, DE etc), so I can use a trick to get the marketplace information from
there(the substract function). However I have no way of joining D_VENDOR_COSTS and
O_AMAZON_BUSINESS_GROUPS, there are no common columns.
74
Which is where table
VENDORS. I use this table
as a bridge between
D_VENDOR_COSTS and
O_AMAZON_BUSINESS_GROUPS
Essentially, this is a way of bridging two tables which otherwise could not be joined.
75
4. USING SUBQUERIES FOR MULTIPLE CALCULATIONS
Let’s look a bit at this query, what it does and why it does it. Again, don’t focus on the functions or
statements we have not covered yet, they’re less important. I needed this query to pull 30% of all YUMA
cases each associate handles per closing reason. So if an associate works 10 cases/ reason a, 10
cases/reason b and 10 cases/reason c I would need the query to pull 3 case_ids for reason a, 3 for reason
b and 3 for reason b. At the same time, if the associate in question worked less than 3 cases/ reason I
would need all those cases in my extract. This requires an advanced algorithm which is hard/impractical
to implement in one query.
So I have a big select containing all the info I want from this query, including the rownumber for all
combinations of user_id + reason. This means that the results from my first layer look like this:
76
Bbb22 Id_2 1 2
Bbb23 Id_2 1 3
The second layer extracts everything from the 1st layer and gives me the max row_count per
combination of user + reason. The table would look like this.
77
78
79