Data Science Details (AutoRecovered)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Here is a list of Essential Skills for hashtag#DataScience: 1. Getting Started What is Data Science?

https://lnkd.in/gDVUh2e 2. SQL Skills {Intermediate - Advanced} Intro to SQL:


https://lnkd.in/giWs-3N Complete SQL Bootcamp: https://lnkd.in/gEUm4Ab 3. Data
Manipulation and Visualization a. Python Pandas: https://lnkd.in/g4DFNpJ Matplotlib/Seaborn:
https://lnkd.in/g_3fx_6 Recommended Course: https://lnkd.in/gZzeHht b. R Dplyr:
https://lnkd.in/erB_FVV Ggplot2: https://lnkd.in/eThJXNr Recommended Course:
https://lnkd.in/geiUx-x c. Tableau Course: https://lnkd.in/gEC56jh 4. Core Statistical Inference
Article: https://lnkd.in/evqJ6We Course: https://lnkd.in/gXWQZJD 5. Machine Learning Article:
https://lnkd.in/gTK2SzC Course: https://lnkd.in/g6xTEGq 6. Version Control (Git/Github) Article:
https://lnkd.in/esTjNma 7. Basics of Command Line Article: https://lnkd.in/e3EQuis 8.
Communication & Data Storytelling Article: https://lnkd.in/eqf5gUV 9. Business Understanding
Course: https://lnkd.in/gpF-u-T --- Having a good understanding of these essential skills will
definitely help you in getting into data science. Take the time to learn, create projects, and share
your work with the community. For more free resources you can visit my site: www.claoudml.com
Filtering results
Congrats on finishing the first chapter! You now know how to select columns and
perform basic counts. This chapter will focus on filtering your results.

In SQL, the WHERE keyword allows you to filter based on both text and numeric values in
a table. There are a few different comparison operators you can use:

 = equal
 <> not equal
 < less than
 > greater than
 <= less than orequal to
 >= greater than or equal to

For example, you can filter text records such as title. The following code returns all
films with the title 'Metropolis':

SELECT title
FROM films
WHERE title = 'Metropolis';

Notice that the WHERE clause always comes after the FROM statement!

Note that in this course we will use <> and not != for the not equal operator, as
per the SQL standard.

What does the following query return?

SELECT title
FROM films
WHERE release_year > 2000;
WHERE AND OR
What if you want to select rows based on multiple conditions where some but not all of
the conditions need to be met? For this, SQL has the OR operator.

For example, the following returns all films released in either 1994 or 2000:

SELECT title
FROM films
WHERE release_year = 1994
OR release_year = 2000;
Note that you need to specify the column for every OR condition, so the following is
invalid:
SELECT title
FROM films
WHERE release_year = 1994 OR 2000;
When combining AND and OR, be sure to enclose the individual clauses in parentheses,
like so:
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');

Otherwise, due to SQL's precedence rules, you may not get the results you're
expecting!

BETWEEN
As you've learned, you can use the following query to get titles of all films released in
and between 1994 and 2000:

SELECT title
FROM films
WHERE release_year >= 1994
AND release_year <= 2000;
Checking for ranges like this is very common, so in SQL the BETWEEN keyword provides
a useful shorthand for filtering values within a specified range. This query is equivalent
to the one above:
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000;
It's important to remember that BETWEEN is inclusive, meaning the beginning and end
values are included in the results!

WHERE IN
As you've seen, WHERE is very useful for filtering results. However, if you want to filter
based on many conditions, WHERE can get unwieldy. For example:
SELECT name
FROM kids
WHERE age = 2
OR age = 4
OR age = 6
OR age = 8
OR age = 10;
Enter the IN operator! The IN operator allows you to specify multiple values in
a WHERE clause, making it easier and quicker to specify multiple OR conditions! Neat,
right?

So, the above example would become simply:

SELECT name
FROM kids
WHERE age IN (2, 4, 6, 8, 10);
Try using the IN operator yourself!
Introduction to NULL and IS NULL
In SQL, NULL represents a missing or unknown value. You can check for NULL values
using the expression IS NULL. For example, to count the number of missing birth dates
in the people table:
SELECT COUNT(*)
FROM people
WHERE birthdate IS NULL;
As you can see, IS NULL is useful when combined with WHERE to figure out what data
you're missing.
Sometimes, you'll want to filter out missing values so you only get results which are
not NULL. To do this, you can use the IS NOT NULL operator.
For example, this query gives the names of all people whose birth dates are not missing
in the people table.
SELECT name
FROM people
WHERE birthdate IS NOT NULL;

LIKE and NOT LIKE


As you've seen, the WHERE clause can be used to filter text data. However, so far you've
only been able to filter by specifying the exact text you're interested in. In the real world,
often you'll want to search for a pattern rather than a specific text string.
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a
column. To accomplish this, you use something called a wildcard as a placeholder for
some other values. There are two wildcards you can use with LIKE:
The % wildcard will match zero, one, or many characters in text. For example, the
following query matches companies like 'Data', 'DataC' 'DataCamp', 'DataMind', and
so on:
SELECT name
FROM companies
WHERE name LIKE 'Data%';
The _ wildcard will match a single character. For example, the following query matches
companies like 'DataCamp', 'DataComp', and so on:
SELECT name
FROM companies
WHERE name LIKE 'DataC_mp';
You can also use the NOT LIKE operator to find records that don't match the pattern you
specify.

Aggregate functions
Often, you will want to perform some calculation on the data in a database. SQL
provides a few functions, called aggregate functions, to help you out with this.

For example,

SELECT AVG(budget)
FROM films;
gives you the average value from the budget column of the films table. Similarly,
the MAX function returns the highest budget:
SELECT MAX(budget)
FROM films;
The SUM function returns the result of adding up the numeric values in a column:
SELECT SUM(budget)
FROM films;
You can probably guess what the MIN function does! Now it's your turn to try out some
SQL functions.

Combining aggregate functions with


WHERE
Aggregate functions can be combined with the WHERE clause to gain further insights from
your data.

For example, to get the total budget of movies made in the year 2010 or later:

SELECT SUM(budget)
FROM films
WHERE release_year >= 2010;

Now it's your turn!


A note on arithmetic
In addition to using aggregate functions, you can perform basic arithmetic with symbols
like +, -, *, and /.
So, for example, this gives a result of 12:
SELECT (4 * 3);
However, the following gives a result of 1:
SELECT (4 / 3);

What's going on here?

SQL assumes that if you divide an integer by an integer, you want to get an integer
back. So be careful when dividing!

If you want more precision when dividing, you can add decimal places to your numbers.
For example,

SELECT (4.0 / 3.0) AS result;


gives you the result you would expect: 1.333.

It's AS simple AS aliasing


You may have noticed in the first exercise of this chapter that the column name of your
result was just the name of the function you used. For example,

SELECT MAX(budget)
FROM films;
gives you a result with one column, named max. But what if you use two functions like
this?
SELECT MAX(budget), MAX(duration)
FROM films;
Well, then you'd have two columns named max, which isn't very useful!
To avoid situations like this, SQL allows you to do something called aliasing. Aliasing
simply means you assign a temporary name to something. To alias, you use
the AS keyword, which you've already seen earlier in this course.

For example, in the above example we could use aliases to make the result clearer:

SELECT MAX(budget) AS max_budget,


MAX(duration) AS max_duration
FROM films;
Aliases are helpful for making results more readable!

Even more aliasing


Let's practice your newfound aliasing skills some more before moving on!

Recall: SQL assumes that if you divide an integer by an integer, you want to get an
integer back.

This means that the following will erroneously result in 400.0:


SELECT 45 / 10 * 100.0;
This is because 45 / 10 evaluates to an integer (4), and not a decimal number like we
would expect.

So when you're dividing make sure at least one of your numbers has a decimal place:

SELECT 45 * 100.0 / 10;


The above now gives the correct answer of 450.0 since the numerator (45 * 100.0) of
the division is now a decimal!

ORDER BY
Congratulations on making it this far! You now know how to select and filter your results.

In this chapter you'll learn how to sort and group your results to gain further insight. Let's
go!

In SQL, the ORDER BY keyword is used to sort results in ascending or descending order


according to the values of one or more columns.
By default ORDER BY will sort in ascending order. If you want to sort the results in
descending order, you can use the DESC keyword. For example,
SELECT title
FROM films
ORDER BY release_year DESC;

gives you the titles of films sorted by release year, from newest to oldest.
How do you think ORDER BY sorts a column of text values by default?

ORDER BY
Congratulations on making it this far! You now know how to select and filter your results.

In this chapter you'll learn how to sort and group your results to gain further insight. Let's
go!

In SQL, the ORDER BY keyword is used to sort results in ascending or descending order


according to the values of one or more columns.
By default ORDER BY will sort in ascending order. If you want to sort the results in
descending order, you can use the DESC keyword. For example,
SELECT title
FROM films
ORDER BY release_year DESC;

gives you the titles of films sorted by release year, from newest to oldest.

How do you think ORDER BY sorts a column of text values by default?

Sorting single columns (DESC)


To order results in descending order, you can put the keyword DESC after your ORDER BY.
For example, to get all the names in the people table, in reverse alphabetical order:
SELECT name
FROM people
ORDER BY name DESC;
Now practice using ORDER BY with DESC to sort single columns in descending order!
Sorting multiple columns
ORDER BY can also be used to sort on multiple columns. It will sort by the first column
specified, then sort by the next, then the next, and so on. For example,
SELECT birthdate, name
FROM people
ORDER BY birthdate, name;

sorts on birth dates first (oldest to newest) and then sorts on the names in alphabetical
order. The order of columns is important!

Try using ORDER BY to sort multiple columns! Remember, to specify multiple columns
you separate the column names with a comma.

GROUP BY
Now you know how to sort results! Often you'll need to aggregate results. For example,
you might want to count the number of male and female employees in your company.
Here, what you want is to group all the males together and count them, and group all
the females together and count them. In SQL, GROUP BY allows you to group a result by
one or more columns, like so:
SELECT sex, count(*)
FROM employees
GROUP BY sex;

This might give, for example:

sex count

male 15

female 19

Commonly, GROUP BY is used with aggregate functions like COUNT() or MAX(). Note


that GROUP BY always goes after the FROM clause!
HAVING a great time
In SQL, aggregate functions can't be used in WHERE clauses. For example, the following
query is invalid:
SELECT release_year
FROM films
GROUP BY release_year
WHERE COUNT(title) > 10;
This means that if you want to filter based on the result of an aggregate function, you
need another way! That's where the HAVING clause comes in. For example,
SELECT release_year
FROM films
GROUP BY release_year
HAVING COUNT(title) > 10;

shows only those years in which more than 10 films were released.

Query information_schema with SELECT


information_schema is a meta-database that holds information about your current
database. information_schema has multiple tables you can query with the known SELECT
* FROM syntax:

 tables: information about all tables in your current database


 columns: information about all columns in all of the tables in your current database
 ...

In this exercise, you'll only need information from the 'public' schema, which is


specified as the column table_schema of the tables and columns tables.
The 'public' schema holds information about user-defined tables and databases. The
other types of table_schema hold system information – for this course, you're only
interested in user-defined stuff.
Instructions 1/4
18 XP
 1

 2

 3

 4

Get information on all table names in the current database, while limiting your query to
the 'public' table_schema.

Show Answer (-17 XP)
Hint

 This information resides in information_schema.tables.


 Don't forget to set table_schema to 'public'.

Enforce data
consistency with
attribute constraints
After building a simple database, it's now time to make use of the features. You'll specify data
types in columns, enforce column uniqueness, and disallow NULL values in this chapter.
ADD a COLUMN with ALTER TABLE
Oops! We forgot to add the university_shortname column to the professors table.
You've probably already noticed:

In chapter 4 of this course, you'll need this column for connecting the professors table
with the universities table.

However, adding columns to existing tables is easy, especially if they're still empty.

To add columns you can use the following SQL query:

ALTER TABLE table_name


ADD COLUMN column_name data_type;

RENAME and DROP COLUMNs in


affiliations
As mentioned in the video, the still empty affiliations table has some flaws. In this
exercise, you'll correct them as outlined in the video.

You'll use the following queries:

 To rename columns:
ALTER TABLE table_name
RENAME COLUMN old_name TO new_name;

 To delete columns:

ALTER TABLE table_name


DROP COLUMN column_name;

Migrate data with INSERT INTO SELECT


DISTINCT
Now it's finally time to migrate the data into the new tables. You'll use the following
pattern:

INSERT INTO ...


SELECT DISTINCT ...
FROM ...;

It can be broken up into two parts:

First part:

SELECT DISTINCT column_name1, column_name2, ...


FROM table_a;
This selects all distinct values in table table_a – nothing new for you.

Second part:

INSERT INTO table_b ...;


Take this part and append it to the first, so it inserts all distinct rows
from table_a into table_b.

One last thing: It is important that you run all of the code at the same time once you
have filled out the blanks.
Delete tables with DROP TABLE
Obviously, the university_professors table is now no longer needed and can safely be
deleted.

For table deletion, you can use the simple command:

DROP TABLE table_name;

Conforming with data types


For demonstration purposes, I created a fictional database table that only holds three
records. The columns have the data types date, integer, and text, respectively.
CREATE TABLE transactions (
transaction_date date,
amount integer,
fee text
);
Have a look at the contents of the transactions table.
The transaction_date accepts date values. According to the PostgreSQL
documentation, it accepts values in the form of YYYY-MM-DD, DD/MM/YY, and so forth.
Both columns amount and fee appear to be numeric, however, the latter is modeled
as text – which you will account for in the next exercise.

Type CASTs
In the video, you saw that type casts are a possible solution for data type issues. If you
know that a certain column stores numbers as text, you can cast the column to a
numeric form, i.e. to integer.
SELECT CAST(some_column AS integer)
FROM table;
Now, the some_column column is temporarilyrepresented as integer instead of text,
meaning that you can perform numeric calculations on the column.

Convert types USING a function


If you don't want to reserve too much space for a certain varchar column, you
can truncate the values before converting its type.
For this, you can use the following syntax:

ALTER TABLE table_name


ALTER COLUMN column_name
TYPE varchar(x)
USING SUBSTRING(column_name FROM 1 FOR x)
You should read it like this: Because you want to reserve only x characters
for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters
of it, and throw away the rest. This way, the values will fit the varchar(x) requirement.

Make your columns UNIQUE with ADD


CONSTRAINT
As seen in the video, you add the UNIQUE keyword after the column_name that should be
unique. This, of course, only works for new tables:
CREATE TABLE table_name (
column_name UNIQUE
);

If you want to add a unique constraint to an existing table, you do it like that:

ALTER TABLE table_name


ADD CONSTRAINT some_name UNIQUE(column_name);
Note that this is different from the ALTER COLUMN syntax for the not-null constraint. Also,
you have to give the constraint a name some_name.

Get to know SELECT COUNT DISTINCT


Your database doesn't have any defined keys so far, and you don't know which columns
or combinations of columns are suited as keys.

There's a simple way of finding out whether a certain column (or a combination)
contains only unique values – and thus identifies the records in the table.
You already know the SELECT DISTINCT query from the first chapter. Now you just have
to wrap everything within the COUNT() function and PostgreSQL will return the number of
unique rows for the given columns:

Identify keys with SELECT COUNT


DISTINCT
There's a very basic way of finding out what qualifies for a key in an existing, populated
table:

1. Count the distinct records for all possible combinations of columns. If the
resulting number x equals the number of all rows in the table for a combination,
you have discovered a superkey.
2. Then remove one column after another until you can no longer remove columns
without seeing the number x decrease. If that is the case, you have discovered a
(candidate) key.
The table professors has 551 rows. It has only one possible candidate key, which is a
combination of two attributes. You might want to try different combinations using the
"Run code" button. Once you have found the solution, you can submit your answer.

ADD key CONSTRAINTs to the tables


Two of the tables in your database already have well-suited candidate keys consisting
of one column each: organizations and universities with
the organization and university_shortname columns, respectively.
In this exercise, you'll rename these columns to id using the RENAME COLUMN command
and then specify primary key constraints for them. This is as straightforward as adding
unique constraints (see the last exercise of Chapter 2):
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)

Note that you can also specify more than one column in the brackets.
CONCATenate columns to a surrogate
key
Another strategy to add a surrogate key to an existing table is to concatenate existing
columns with the CONCAT() function.

Let's think of the following example table:

CREATE TABLE cars (


make varchar(64) NOT NULL,
model varchar(64) NOT NULL,
mpg integer NOT NULL
)

The table is populated with 10 rows of completely fictional data.

Unfortunately, the table doesn't have a primary key yet. None of the columns consists of
only unique values, so some columns can be combined to form a key.

In the course of the following exercises, you will combine make and model into such a


surrogate key.

class Solution {

public:

void reorderList(ListNode *head) {

if (!head) return;

ListNode dummy(-1);

dummy.next = head;

ListNode *p1 = &dummy, *p2 = &dummy;

for (; p2 && p2->next; p1 = p1->next, p2 = p2->next->next);

for ( ListNode *prev = p1, *curr = p1->next; curr && curr->next;){

ListNode *tmp = curr->next;

curr->next = curr->next->next;
tmp->next = prev->next;

prev->next = tmp;

for ( p2 = p1->next, p1->next = NULL,p1 = head; p2; ){

ListNode *tmp = p1->next;

p1->next = p2;

p2 = p2->next;

p1->next->next = tmp;

p1 = tmp;

};
Populate the "professor_id" column
Now it's time to also populate professors_id. You'll take the ID directly
from professors.

Here's a way to update columns of a table based on values in another table:

UPDATE table_a
SET column_to_update = table_b.column_to_update_from
FROM table_b
WHERE condition1 AND condition2 AND ...;

This query does the following:

1. For each row in table_a, find the corresponding row


in table_b where condition1, condition2, etc., are met.
2. Set the value of column_to_update to the value of column_to_update_from (from
that corresponding row).

The conditions usually compare other columns of both tables, e.g. table_a.some_column


= table_b.some_column. Of course, this query only makes sense if there is
only one matching row in table_b.

Referential integrity violations


Referential integrity from table A to table B is violated...

 ...if a record in table B that is referenced from a record in table A is

deleted.
 ...if a record in table A referencing a non-existing record from table B is

inserted.

 Foreign keys prevent violations!


 Dealing with violations
 CREATE TABLE a (

 id integer PRIMARY KEY,

 column_a varchar(64),

 ...,

 b_id integer REFERENCES b (id) ON DELETE NO ACTION

 );

  

 CREATE TABLE a (

 id integer PRIMARY KEY,

 column_a varchar(64),

 ...,

 b_id integer REFERENCES b (id) ON DELETE CASCADE

 );

Dealing with violations, contd.


ON DELETE...

 ...NO ACTION: Throw an error

 ...CASCADE: Delete all referencing records

 ...RESTRICT: Throw an error

 ...SET NULL: Set the referencing column to NULL

 ...SET DEFAULT: Set the referencing column to its default value

You might also like