Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Exclusive SQL Tutorial on Data Analysis

in R
Introduction
Many people are pursuing data science as a career (to become a data scientist) choice these days.
With the recent data deluge, companies are voraciously headhunting people who can handle,
understand, analyze, and model data.

Be it college graduates or experienced professionals, everyone is busy searching for the best courses
or training material to become a data scientist. Some of them even manage to learn Python or R, but
still can’t land their first analytics job!

What most people fail to understand is that the data science/analytics industry isn’t just limited to
using Python or R. There are several other coding languages which companies use to run their
businesses.

Among all, the most important and widely used language is SQL (Structured Query Language). You
must learn it.

I’ve realized that, as a newbie, learning SQL is somewhat difficult at home. After all, setting up a server
enabled database engine isn’t everybody’s cup of tea. Isn’t it? Don’t you worry.

In this article, we’ll learn all about SQL and how to write its queries.

Note: This article is meant to help R users who wants to learn SQL from scratch. Even if you are new
to R, you can still check out this tutorial as the ultimate motive is to learn SQL here.

Table of Contents
1. Why learn SQL ?
2. What is SQL?
3. Getting Started with SQL
o Data Selection
o Data Manipulation
o Strings & Dates
4. Practising SQL in R

Why learn SQL ?


Good question! When I started learning SQL, I asked this question too. Though, I had no one to
answer me. So, I decided to find it out myself.

SQL is the de facto standard programming language used to handle relational databases.
Let’s look at the dominance / popularity of SQL in worldwide analytics / data science
industry. According to an online survey conducted by Oreilly Media in 2016, it was found that among
all the programming languages, SQL was used by 70% of the respondents followed by R and Python.
It was also discovered that people who know Excel (Spreadsheet) tend to get significant salary boost
once they learn SQL.

Also, according to a survey done by datasciencecentral, it was inferred that R users tend to get a nice
salary boost once they learn SQL. In a way, SQL as a language is meant to complement your current
set of skills.

Since 1970, SQL has remained an integral part of popular databases such as Oracle, IBM DB2,
Microsoft SQL Server, MySQL, etc. Not only learning SQL with R will increase your employability, but
SQL itself can make way for you in database management roles.

What is SQL ?
SQL (Structured Query Language) is a special purpose programming language used to manage,
extract, and aggregate data stored in large relational database management systems.

In simple words, think of a large machine (rectangular shape) consisting of many, many boxes (again
rectangles). Each box comprises a table (dataset). This is a database. A database is an organized
collection of data. Now, this database understands only one language, i.e, SQL. No English,
Japanese, or Spanish. Just SQL. Therefore, SQL is a language which interacts with the databases to
retrieve data.

Following are some important features of SQL:

1. It allows us to create, update, retrieve, and delete data from the database.
2. It works with popular database programs such as Oracle, DB2, SQL Server, etc.
3. As the databases store humongous amounts of data, SQL is widely known for it speed and
efficiency.
4. It is very simple and easy to learn.
5. It is enabled with inbuilt string and date functions to execute data-time conversions.

Currently, businesses worldwide use both open source and proprietary relational database
management systems (RDBMS) built around SQL.

Getting Started with SQL


Let’s try to understand SQL commands now. Most of these commands are extremely easy to pick
up as they are simple “English words.” But make sure you get a proper understanding of their
meanings and usage in SQL context. For your ease of understanding, I’ve categorized the SQL
commands in three sections:

1. Data Selection – These are SQL’s indigenous commands used to retrieve tables from
databases supported by logical statements.
2. Data Manipulation – These commands would allow you to join and generate insights from
data.
3. Strings and Dates – These special commands would allow you to work diligently with dates
and string variables.

Before we start, you must know that SQL functions recognize majorly four data types. These are:

1. Integers – This datatype is assigned to variables storing whole numbers, no decimals. For
example, 123,324,90,10,1, etc.
2. Boolean – This datatype is assigned to variables storing TRUE or FALSE data.
3. Numeric – This datatype is assigned to variables storing decimal numbers. Internally, it is
stored as a double precision. It can store up to 15 -17 significant digits.
4. Date/Time – This datatype is assigned to variables storing data-time information. Internally, it
is stored as a time stamp.

That’s all ! If SQL finds a variable whose type is anything other than these four, it will throw read errors.
For example, if a variable has numbers with a comma (like 432,), you’ll get errors. SQL as a language
is very particular about the sequence of commands given. If the sequence is not followed, it starts to
throw errors. Don’t worry I’ve defined the sequence below. Let’s learn the commands. In the following
section, we’ll learn to use them with a data set.

Data Selection
1. SELECT – It tells you which columns to select.
2. FROM – It tells you columns to be selected should be from which table (dataset)
3. LIMIT – By default, a command is executed on all rows in a table. This commands limits the
number of rows. Limiting the rows leads to faster execution of commands.
4. WHERE – This commands specifies a filter condition; i.e., the data retrieval has to be done
based on some variable filtering.
5. Comparison Operators – Everyone knows these operators as ( = , != , < , > , <= , >= ). They
are used in conjunction with the WHERE command.
6. Logical Operators – The famous logical operators (AND, OR, NOT ) are also being used to
specify multiple filtering conditions. Other operators are:
o LIKE – It is used to extract similar values and not exact values.
o IN – It is used to specify the list of values to extract or leave out from a variable.
o BETWEEN – As the names suggests, it activates a condition based on variable(s) in
the table.
o IS NULL – It allows you to extract data without missing values from the
specified column.
7. ORDER BY – It is used to order a variable in descending or ascending order.

Data Manipulation
1. Aggregate Functions – Originating from statistics, these functions are insanely helpful in
generating quick insights from data sets.
o COUNT – It counts the number of observations.
o SUM – It calculates the sum of observations.
o MIN/MAX – It calculates the min/max and, eventually, the range of a numerical
distribution.
o AVG – It calculates the average aka mean.
2. GROUP BY – For categorical variables, it calculates the above stats based on their unique
levels.
3. HAVING – It is mostly used for strings to specify a particular string or combination while
retreiving data.
4. DISTINCT – It returns the unique number of observations.
5. CASE – It is used to create rules using if/else conditions.
6. JOINS – It is a popular function in SQL as individual tables are often required to merged to
create more meaningful data. It can implement the following variations of join. To understand
these joins, let’s say we have two tables: Table A and Table B. Both tables have
seven variables. Two variables of Table A are also available in Table B (Shown below as an
image). Based on a specified criteria:
o INNER JOIN – It returns the common rows specifying the joining criteria from A and
B.
o OUTER JOIN – It returns the rows which are not common to A and B.
o LEFT JOIN – It returns the rows which are in A but not in B.
o RIGHT JOIN – It returns the rows which are in B but not in A.
o FULL OUTER JOIN – It returns all the rows from both tables. It often leads to NULL
values in the resultant data set.
7. ON – It is used to specify a column used for filtering while joining tables.
8. UNION – It is similar to rbind() in R. Use it to combine two tables where both the tables have
identical variable names.

In addition, you can also write complex join commands by using comparison operators or WHERE or
ON command to specify a condition.

Strings and Dates


1. NOW – It return current time.
2. LEFT – It return the number of characters (specified by user) from the left in a string.
3. RIGHT – It return the number of characters (specified by user) from the right in a string.
4. LENGTH – It returns the length of the string.
5. TRIM – It is used to remove characters from the beginning and end of the string.
6. SUBSTR – It is used to extract part of string with specified start and end positions; it’s
similar to the substr() function in R.
7. CONCAT – It is used to combine strings. concat means concatenation.
8. UPPER – It converts a string to the upper case.
9. LOWER – It converts a string to the lower case.
10. EXTRACT – It is used to extract date components such as date, month, day, year, and
minute from a date variable.
11. DATE_TRUNC – It is used to round dates to its nearest unit of measurement. It is mostly used
when the data is captured in smallest unit of measurements such as milliseconds, time zone
specific, etc.
12. COALESCE – It is used to impute missing values.

These commands are not case sensitive; i.e., you need not write them in capitals always. But make
sure consistency is maintained. SQL commands follow this standard sequence:

1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY
7. LIMIT

You might also like